Skip to main content
Log in

What Can the Millions of Random Treatments in Nonexperimental Data Reveal About Causes?

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

We propose a new method to estimate causal effects from nonexperimental data. Each pair of sample units is first associated with a stochastic ‘treatment’—differences in factors between units—and an effect—a resultant outcome difference. It is then proposed that all pairs can be combined to provide more accurate estimates of causal effects in nonexperimental data, provided a statistical model relating combinatorial properties of treatments to the accuracy and unbiasedness of their effects. The article introduces one such model and a Bayesian approach to combine the \(O(n^2)\) pairwise observations typically available in nonexperimental data. This also leads to an interpretation of nonexperimental datasets as incomplete, or noisy, versions of ideal factorial experimental designs. This approach to causal effect estimation has several advantages: (1) it expands the number of observations, converting thousands of individuals into millions of observational treatments; (2) starting with treatments closest to the experimental ideal, it identifies noncausal variables that can be ignored in the future, making estimation easier in each subsequent iteration while departing minimally from experiment-like conditions; (3) it recovers individual causal effects in heterogeneous populations. We evaluate the method in simulations and the National Supported Work (NSW) program, an intensively studied program whose effects are known from randomized field experiments. We demonstrate that the proposed approach recovers causal effects in common NSW samples, as well as in arbitrary subpopulations and an order-of-magnitude larger supersample with the entire national program data, outperforming Statistical, Econometrics and Machine Learning estimators in all cases. As a tool, the approach also allows researchers to represent and visualize possible causes, and heterogeneous subpopulations, in their samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The assumption that \({\mathcal {X}}^m\) includes all factors associated both with treatment assignment and outcomes  [17, 18].

  2. The typical likelihood notation \({\mathcal {N}}[ y \mid \mu , \sigma ^2]\), for the likelihood of an observation y, given a mean \(\mu\) and variance \(\sigma ^2\), is used.

  3. See  [10] for further algorithmic details.

  4. For short, we use x to refer to both Boolean vectors and set variables (i.e., the set of variables with value +1).

  5. This function is often called a rectifier and is currently the most popular activation function in deep neural networks..

  6. The typical likelihood notation \({\mathcal {N}}[ y \mid \mu , \sigma ^2]\), for an observation y with mean \(\mu\) and variance \(\sigma ^2\), is used.

  7. Thus approximately 100K-200M treatments.

  8. Remember that the cosine of angles between pairs of vectors in standardized datasets correspond to their Pearson correlation.

References

  1. Pearl J. The seven tools of causal inference, with reflections on machine learning. Commun ACM. 2019;62(3):54–60.

    Article  Google Scholar 

  2. Athey S. Beyond prediction: Using big data for policy problems. Science 2017;355(6324):483–485. http://science.sciencemag.org/content/355/6324/483.full.pdf. https://doi.org/10.1126/science.aal4321

  3. Imbens GW. Better late than nothing: some comments on deaton (2009) and heckman and urzua (2009). J Econ Lit. 2010;48(2):399–423. https://doi.org/10.1257/jel.48.2.399.

    Article  Google Scholar 

  4. Duflo E, Glennerster R, Kremer M. Using randomization in development economics research: a toolkit. In: Schultz TP, Strauss JA (eds.) Handbook of development economics, 2008;4:3895–3962. Elsevier. Chap. 61. https://ideas.repec.org/h/eee/devchp/5-61.html

  5. Deaton A. Instruments, randomization, and learning about development. J Econ Lit. 2010;48(2):424–55. https://doi.org/10.1257/jel.48.2.424.

    Article  Google Scholar 

  6. Heckman JJ, Smith JA. Assessing the case for social experiments. J Econ Perspect. 1995;9(2):85–110. https://doi.org/10.1257/jep.9.2.85.

    Article  Google Scholar 

  7. Xie Y. Population heterogeneity and causal inference. Proc Natl Acad Sci. 2013;110(16):6262. https://doi.org/10.1073/pnas.1303102110.

    Article  Google Scholar 

  8. Morgan SL. Counterfactuals and causal inference : methods and principles for social research, New York (2007). Includes bibliographical references (pp. 291–316) and index.; ID: http://id.lib.harvard.edu/aleph/010910135/catalog

  9. Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci Rev J Inst Math Stat. 2010;25(1).

  10. Colson KE, Rudolph KE, Zimmerman SC, Goin DE, Stuart EA, Van DLM, Ahern J. Optimizing matching and analysis combinations for estimating causal effects. Nat Sci Rep. 2016;6(1). https://doi.org/10.1038/srep23222

  11. MacKay DJC. Information theory, inference, and learning algorithms. UK; New York: Cambridge University Press, Cambridge; 2003.

    MATH  Google Scholar 

  12. Hall DLDL, Liggins ME, Llinas J. Handbook of multisensor data fusion: theory and practice. Boca Raton, Florida: CRC Press; 2008.

    Google Scholar 

  13. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55. https://doi.org/10.1093/biomet/70.1.41.

    Article  MathSciNet  MATH  Google Scholar 

  14. Reichenbach H. The direction of time. Berkeley: University of California Press; 1956.

    Book  Google Scholar 

  15. Suter R, Miladinovic D, Schölkopf B, Bauer S. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6056–6065. PMLR, 2019. http://proceedings.mlr.press/v97/suter19a.html.

  16. Scholkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, Bengio Y. Toward causal representation learning. Proc IEEE. 2021;109(5):612–34. https://doi.org/10.1109/JPROC.2021.3058954.

    Article  Google Scholar 

  17. Heckman JJ, Ichimura H, Todd P. Matching as an econometric evaluation estimator. Rev Econ Stud. 1998;65(2):261–94. https://doi.org/10.1111/1467-937X.00044.

    Article  MathSciNet  MATH  Google Scholar 

  18. Shadish WR, Clark MH, Steiner PM. Can nonrandomized experiments yield accurate answers? a randomized experiment comparing random and nonrandom assignments. J Am Stat Assoc. 2008;103(484):1334–44.

    Article  MathSciNet  MATH  Google Scholar 

  19. Wang M, Zhao Y, Zhang B. Efficient test and visualization of multi-set intersections. Sci Rep. 2015;5(1):16923–16923. https://doi.org/10.1038/srep16923.

    Article  Google Scholar 

  20. Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal effect inference with deep latent-variable models. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds.) Advances in neural information processing systems 30, pp. 6446–6456. Curran Associates, Inc., (2017). http://papers.nips.cc/paper/7223-causal-effect-inference-with-deep-latent-variable-models.pdf

  21. Wang Y, Blei DM. The blessings of multiple causes. J Am Stat Assoc. 2020;114(528):1574–96. https://doi.org/10.1080/01621459.2019.1686987.

    Article  MathSciNet  MATH  Google Scholar 

  22. Abadie A, Diamond A, Hainmueller J. Comparative politics and the synthetic control method. Am J Polit Sci. 2015;59(2):495–510. https://doi.org/10.1111/ajps.12116.

    Article  Google Scholar 

  23. Ribeiro A. An experimental-design perspective on population genetic variation. Proc Natl Acad Sci (PNAS) (Under Review) 2020.

  24. Lalonde RJ. Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev. 1986;76(4):604–20.

    Google Scholar 

  25. Angrist JD. Mostly harmless econometrics :an empiricist’s companion. Princeton: Princeton University Press; 2009.

    Book  MATH  Google Scholar 

  26. A., J.S., E., P.T. Does matching overcome lalonde’s critique of nonexperimental estimators? J Econ 2005;125(1):305–353. https://doi.org/10.1016/j.jeconom.2004.04.011

  27. Dehejia R, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. J Am Stat Assoc. 1999;94:1053.

    Article  Google Scholar 

  28. Zhao Z. Matching estimators and the data from the national supported work demonstration again. Germany: Bonn; 2006.

    Book  Google Scholar 

  29. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701. https://doi.org/10.1037/h0037350.

    Article  Google Scholar 

  30. Fisher R. Arrangement of field experiments. Agric J India 1927;22.

  31. Dasgupta T, Pillai NS, Rubin DB. Causal inference from 2k factorial designs by using potential outcomes. J R Stat Soc Ser B (Stat Methodol). 2015;77(4):727–53.

    Article  MathSciNet  MATH  Google Scholar 

  32. Pearl J. 3. the foundations of causal inference. Sociol Methodol 2010;40(1):75–149. https://doi.org/10.1111/j.1467-9531.2010.01228.x

  33. King G, Nielsen R. Why propensity scores should not be used for matching. Polit Anal. 2019;27(4):435–54. https://doi.org/10.1017/pan.2019.11.

    Article  Google Scholar 

  34. Imai K, King G, Stuart E. Misunderstandings between experimentalists and observationalists about causal inference 2008;171(2).

  35. Pearl J. Causality : Models, Reasoning, and Inference, Cambridge, U.K. ; New York (2000). Includes bibliographical references (p. 359-373) and indexes.; ID: http://id.lib.harvard.edu/aleph/008372583/catalog

  36. Stuart EA, Lee BK, Leacy FP. Prognostic score-based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. J Clin Epidemiol. 2013;66(8):84–901. https://doi.org/10.1016/j.jclinepi.2013.01.013.

    Article  Google Scholar 

  37. Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015;34(28):3661–79. https://doi.org/10.1002/sim.6607.

    Article  MathSciNet  Google Scholar 

  38. Belitser SV, Martens EP, Pestman WR, Groenwold RHH, de Boer A, Klungel OH. Measuring balance and model selection in propensity score methods. Pharmacoepidemiol Drug Saf. 2011;20(11):1115–29. https://doi.org/10.1002/pds.2188.

    Article  Google Scholar 

  39. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107. https://doi.org/10.1002/sim.3697.

    Article  MathSciNet  Google Scholar 

  40. McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods. 2004;9(4):403–25. https://doi.org/10.1037/1082-989X.9.4.403.

    Article  Google Scholar 

  41. Diamond A, Sekhon JS. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Rev Econ Stat. 2012;95(3):932–45.

    Article  Google Scholar 

  42. Hansen BB. The prognostic analogue of the propensity score. Biometrika. 2008;95(2):481–8. https://doi.org/10.1093/biomet/asn004.

    Article  MathSciNet  MATH  Google Scholar 

  43. Franklin JM, Rassen JA, Ackermann D, Bartels DB, Schneeweiss S. Metrics for covariate balance in cohort studies of causal effects. Stat Med. 2014;33(10):1685–99. https://doi.org/10.1002/sim.6058.

    Article  MathSciNet  Google Scholar 

  44. Iacus SM, King G, Porro G. Causal inference without balance checking: coarsened exact matching. Polit Anal. 2012;20(1):1–24. https://doi.org/10.1093/pan/mpr013.

    Article  Google Scholar 

  45. Huling JD, Mak S. Energy balancing of covariate distributions 2020.

  46. Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann Stat. 2013;41(5):2263–91. https://doi.org/10.1214/13-AOS1140.

    Article  MathSciNet  MATH  Google Scholar 

  47. Tarantola A. Inverse problem theory and methods for model paramenter estimation. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2005.

    Book  MATH  Google Scholar 

  48. Hastie T. The elements of statistical learning : data mining, inference, and prediction. New York, NY: Springer Series in Statistics. Springer; 2001.

    Book  MATH  Google Scholar 

  49. Athey S, Imbens G. Recursive partitioning for heterogeneous causal effects. Proc Natl Acad Sci 2016;113(27):7353–7360. http://www.pnas.org/content/113/27/7353.full.pdf. https://doi.org/10.1073/pnas.1510489113

  50. van Der Laan J, M, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol 2007;6:25.

  51. Chatton A, Le Borgne F, Leyrat C, Gillaizeau F, Rousseau C, Barbin L, Laplaud D, Leger M, Giraudeau B, Foucher Y. G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study. Nat Sci Rep. 2020;10(1):9219–9219. https://doi.org/10.1038/s41598-020-65917-x.

    Article  Google Scholar 

  52. Stanley XP, Colleen RM, Marsha RP, Susan SM, Christopher BP, David SP. Use of stabilized inverse propensity scores as weights to directly estimate relative risk and its confidence intervals. Value Health 2010;13(2):273–277. https://doi.org/10.1111/j.1524-4733.2009.00671.x

  53. Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology (Cambridge, Mass.) 2000;11(5):550–560. https://doi.org/10.1097/00001648-200009000-00011

  54. Aalen OO, Farewell VT, de Angelis D, Day NE, Nöel Gill O. A markov model for hiv disease progression including the effect of hiv diagnosis and treatment: application to aids prediction in england and wales. Stat Med. 1997;16(19):2191–210. https://doi.org/10.1002/(SICI)1097-0258(19971015)16:19<2191::AID-SIM645>3.0.CO.

    Article  Google Scholar 

  55. Luque-Fernandez MA, Schomaker M, Rachet B, Schnitzer ME. Targeted maximum likelihood estimation for a binary treatment: A tutorial. Stat Med. 2018;37(16):2530–46. https://doi.org/10.1002/sim.7628.

    Article  MathSciNet  Google Scholar 

  56. Dehejia RH, Wahba S. Propensity score-matching methods for nonexperimental causal studies. Rev Econ Stat. 2002;84(1):151–61. https://doi.org/10.1162/003465302317331982.

    Article  Google Scholar 

  57. Furst M, Jackson J, Smith S. Improved learning of ac0 functions. In: Annual Workshop on Computational Learning Theory: Proceedings of the Fourth Annual Workshop on Computational Learning Theory; 05-07 Aug. 1991, 1991;317–325. http://search.proquest.com/docview/31297843/

  58. Figueiredo M. Adaptive sparseness for supervised learning. IEEE Trans Pattern Anal Mach Intell. 2003;25(9):1150–9.

    Article  Google Scholar 

  59. Kiefer J, Wolfowitz J. Stochastic estimation of the maximum of a regression function. Ann Math Stat. 1952;23(3):462–6. https://doi.org/10.1214/aoms/1177729392.

    Article  MathSciNet  MATH  Google Scholar 

  60. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human- level control through deep reinforcement learning. Nature. 2015;518(7540):529. https://doi.org/10.1038/nature14236.

    Article  Google Scholar 

  61. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. https://doi.org/10.1038/nature14539.

    Article  Google Scholar 

  62. Servedio RA. On learning monotone dnf under product distributions. Inf Comput. 2004;193(1):57–74.

    Article  MathSciNet  MATH  Google Scholar 

  63. Bshouty N, Tamon C. On the fourier spectrum of monotone functions. J ACM (JACM). 1996;43(4):747–70.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andre F. Ribeiro.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Objective Function

In this section, we devise Eq. (5) for binary observed variables. We consider an extension for continuous variables in Appendix 2.

Treatment Likelihood

Let’s first define requirements for a pair of individuals (ij) to represent an univariate treatment \(\{a\}\) with certainty, \(a \in {\mathcal {X}}^m\). A first requirement relates to the treated variable a itself.Footnote 4 The requirement is that \(x_i(a) \cdot \lnot x_j(a)=\mathbbm {1}\), where \(\lnot\) and \(\cdot\) are the boolean NOT and AND operators and \(\mathbbm {1}\) is a m-sized vector with all \(+1\) values. A second requirement relates to other variables, \(b \ne a\). The requirement is that these variables are either also treated, \(x_i(b) \cdot \lnot x_j(b)=\mathbbm {1}\), or common, \(x_i(b) \cdot x_j(b)=\mathbbm {1}\), between i and j.

With these requirements, we will define individuals’ positions, \(x_i\), as random observations of factorial runs. The norm of vectors, \(\vert x_i \vert\) and \(\vert x_j \vert\), relate to the likelihood of treatment and angles, \(\theta _{ij}\), to observe confounding conditions among pairs of individuals. Their dot-product, \(\langle x_i, x_j \rangle = \vert x_i \vert \vert x_j \vert \cos \theta _{ij}\), will reflect both factors and become a key element in the optimization.

Fig. 7
figure 7

Example vector pair and the relationship among its dot-product, vector sum and vector difference, as well as, functionals—g(x) and h(x)—and outcome pairwise differences, \(y_{ij}\)

Before formulating this relationship in detail, reconsider Eq. (1). Let \(g(x) = \max (0,x)\) which makes nonpositive coordinates in x zero.Footnote 5 We can decompose an individual pair into vectors for their difference, \(x_i{-}x_j\), and sum, \(x_i{+}x_j\). Due to the sign convention, the first contains treated coordinates and the second non-treated coordinates. The dot-product relates the sum of the two vectors geometrically when \(g(x){=}h(x)\). According to Eq. (1), this corresponds to the assumption that the variance is proportional to the expected effect of non-treated variables, i.e., the expected amount of confounding. The relationship leads to a general least-squares solution (considered in further detail below), where we minimize a residual, \(y_{ij} - \vert g({x}_i{-}{x}_j) \vert\), and a penalty, \(\vert h({x}_j{+}{x}_j) \vert\). Notice that \(\vert g({x}_i{-}{x}_j) \vert\) is also a distance. Figure 7 sketches the (distance) residual and cost for an example pair. Letting \(x^{\prime } =g(x)\), the Law of Cosines leads to

$$\begin{aligned} \begin{aligned}&\Big (y_{ij} -\vert {x}^{\prime }_i - {x}^{\prime }_j \vert \Big ) + \vert {x}^{\prime }_i + {x}^{\prime }_j \vert ,\\&\quad = \Big ( y_{ij} - \vert {x}^{\prime }_i \vert ^2 - \vert {x}^{\prime }_i \vert ^2 + 2 \langle {x}^{\prime }_i, {x}^{\prime }_j \rangle \Big ) \\&\qquad + \vert {x}^{\prime }_i \vert ^2 + \vert {x}^{\prime }_i \vert ^2 + 2 \langle {x}^{\prime }_i, {x}^{\prime }_j \rangle ,\\&\quad = y_{ij} + 4 \langle {x}^{\prime }_i, {x}^{\prime }_j \rangle ,\\ \end{aligned} \end{aligned}$$
(7)
Fig. 8
figure 8

Representative pairings of individuals (leftmost column) and their related dot-product (middle column) and treatments as Venn diagrams (rightmost column): a univariate individual-level treatment with no observed confounding risk; b multivariate treatment with no confounding risk; c univariate treatment with confounding risk, d no treatment

Let’s then define the probability \(p_y({x}_i,{x}_j)\) and its relation to the dot-product in further detail. We defined a variable \(a\in {\mathcal {X}}^m\) as being under a factorial treatment when a variable is treated and all other variables are either common or also treated. Figure 8a–d (3rd column) depicts these conditions as Venn diagrams for the cases in Fig. 1 (main article). The dot-product in the \(2^m\)-dimensional Boolean vector space  [57] has the interpretation

$$\begin{aligned} \langle z_i, z_j \rangle = \frac{1}{2^m} \sum _{a=1}^m z_i(a)z_j(a) = E_U[z_i\cdot z_j] \end{aligned}$$
(8)

where \(z \in \{\texttt {-}1,\texttt {+}1\}^m\) and the expectation is taken uniformly over all \(a \in {\mathcal {X}}\). The dot-product indicates the expected number of common variables between vectors. It also defines the \(\ell _2\)-norm \(\vert z \vert = \sqrt{\langle z, z \rangle } = \sqrt{E_U[z^2] }\). We consider, instead,

$$\begin{aligned} \langle z_i, \lnot z_j \rangle = E_U[z_i \cdot \lnot z_j], \end{aligned}$$
(9)

which indicates the expected number of treated variables between vectors, when \({\mathcal {X}}\) variables are uniformly distributed (notated U). Non-uniform distributions and continuous treatments are considered in Appendix 2. Due to the \(\{\texttt {-}1,\texttt {+}1\}\) sign convention, the product in the standardized covariate space X leads to the relation

$$\begin{aligned} E_{U} [x_i \cdot \lnot x_j] = {\left\{ \begin{array}{ll} -\langle {x}_i, {x}_j \rangle ,&{} \text {if } \langle {x}_i,{x}_j \rangle \le 0\\ 0, &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(10)

When \(\langle {x}_i, {x}_j \rangle =-1\), the probability of drawing a treated variable when comparing i and j is 1.0—i.e., \(x_i \cdot \lnot x_j {=}+1, \forall a \in {\mathcal {X}}\). We also associate the pair with a treatment size, \(\phi ^{cx}_{ij}\), and a possible confounding risk in relation to the remaining \(n{-}2\) individuals, \(\phi ^{bl}_{ij}\). We consider these in a Bayesian framework next.

Sample Balance and Optimization

We now turn to conditions \(\phi _{ij}\). We will represent these conditions geometrically, while, at the same time, relating them to the density \(p_y({x}_i,{x}_j)\). Equation (1) implies the following likelihood over expected effects:Footnote 6

$$\begin{aligned} \begin{aligned} \prod _{i,j} {\mathcal {N}}[ y_{ij} \; \mid \; p_y({x}_i,{x}_j) f({x}_i{\ominus }{x}_j), h({x}_j {\ominus } {x}_i) ]. \end{aligned} \end{aligned}$$
(11)

To consider also learning from pairs with balanced treatments, we introduce a Gaussian prior \({\mathcal {N}}({x}_i{\ominus }{x}_j \; \mid 0, \phi ^{bl}_{ij})\) for the probability \(p_y({x}_i,{x}_j)\), where \(\phi ^{bl}_{ij}\) is a strictly positive scalar for each pair. If not a factorial treatment, the probability that the pair represents the treatment \({x}_i{\ominus }{x}_j\) depends on the likelihood that non-common factors in the pair (2-sample) are balanced in the remainder of the sample.

Combining the likelihood in Eq. (11) with the Gaussian prior, we obtain

$$\begin{aligned} \begin{aligned} \prod _{i,j} {\mathcal {N}}[ y_{ij} \; \mid \; p_y({x}_i,{x}_j) f({x}_i{\ominus }{x}_j), \phi ^{cx}_{ij} ]\times {\mathcal {N}}[ {x}_i{\ominus }{x}_j \; \mid \; 0, \phi ^{bl}_{ij} ], \end{aligned} \end{aligned}$$
(12)

where \(\phi ^{cx}_{ij} = h({x}_i{\ominus }{x}_j)\) is the pair’s variance. This formulates Bayesian priors for conditions \(\phi _{ij}\) from Eq. (1) in a way similar to a Tikhonov regularization  [58].

Considering a single position \({x}_i\) and treatment v, our goal is to transform \({x}_i\) such that \(\vert {\mathbf {x}}_i - {\mathbf {x}}_j \vert ^2 = p_y({x}_i,{x}_j)f({x}_i{\ominus }{x}_j)\) for \(0 < j \le n\). Combining likelihoods in Eqs. (12) and (10), taking logarithms and dropping constants we arrive at the objective

$$\begin{aligned} \begin{aligned} \Gamma ({x}_i)&= \min _{{x}_i} \sum _{j=1}^n (1+{\hat{\phi }}_{ij}^{cx})[ \langle {x}_i, {x}_j \rangle + y_{ij}]^2 + {\hat{\phi }}_{ij}^{bl} \langle {x}_i, {x}_j \rangle ^2 +b_i,\\ {\hat{\phi }}_{ij}^{cx}&= \frac{\vert {x}_i + {x}_j \vert }{m}, \\ {\hat{\phi }}_{ij}^{bl}&= \vert \frac{1}{n}\sum _{k=0}^n \langle {x}_k, {x}_i+{x}_j \rangle \vert ,\\ y_{ij}&= \max (0, y_i - y_j). \end{aligned} \end{aligned}$$
(13)

We consider the overall objective function first, then pairwise penalties estimates, notated \({\hat{\phi }}_{ij}\), followed by the intercept \(b_i\) and outcome differences \(y_{ij}\). If we minimize Eq. (13) with respect to the m-sized vector \({x}_i\) we get a maximum a-posteriori likelihood estimate for individuals’ positions. The objective function argument is an individual’s position \({x}_i\) (rowspace vectors) and not factor positions (column space vectors). More specifically, Eq. (13) leads to an iterative gradient minimization procedure for each individual, \({\mathbf {x}}^{t+1}_i = {\mathbf {x}}^t_i -\eta \Gamma ({\mathbf {x}}^t_i)\), with \(\eta\) as learning rate and \({\mathbf {x}}_i^{t=0} = {x}_i\). Considering the entire sample population, we iteratively minimize their gradient sum, \(\sum _{i}[{\mathbf {x}}^t_i - \Gamma ({\mathbf {x}}^t_i)]\). Both Statistics and Machine Learning researchers have considered the problem of minimizing an objective function in the form of a sum of gradients. We use a Stochastic Gradient Descent (SGD)  [59] which samples a subset of summand functions at every step and has found wide-spread use in Machine Learning  [60, 61]. The scheme allows us to consider billions of observation pairs when estimating effects. We discuss other implementation details in “Implementation”. The resulting optimization transforms the original space X into \(T_y(X)\). It transforms treatment vector differences, until they reflect difference in outcomes that approximate, according to the defined costs, those that would be observed in factorial experiments.

Equation (13) defines \(y_{ij}\) as nonnegative outcome differences. The scalar term \(b_i\) is an individual’s intercept with expected zero mean that is also minimized. Terms \({\hat{\phi }}^{bl}_{ij}\) and \({\hat{\phi }}^{cx}_{ij}\) reflect difference-of-means balance and treatment size conditions for pairs of individuals i and j. Pairs with both zero penalties (balanced and univariate treatments) reproduce, according to the previous assumptions, factorial or randomized treatments. In this case, \(\langle {x}_i, {x}_j \rangle\) is made to reflect \(y_{ij}\).

Calculations will run over thousands of iterations for large observation matrices X. Therefore, it is important to define simple penalties \(\phi _{ij}\). We defined treatments by dividing pairs’ variables into treated and common variable subsets. With the \([\texttt {-}1,\texttt {+}1]\) sign convention, the vector \({x}_i + {x}_j\) has non-zero values for non-treated variables. In Eq. (13), the penalty \({\hat{\phi }}^{cx}_{ij}\) is therefore a normalized estimate for the number of non-treated variables. Non-treated (i.e., non-zero) coordinates in \(x_i+x_j\) can confound outcome effect observations, \(y_{ij}\), Fig. 7c. We did not deem pairs under these conditions as necessarily unsuitable for estimation. Instead, we considered that the pair has coordinates that need to be balanced in individuals k that do not belong to the pair, \(k\ne i,j\). For an out-of-pair individual k, \(\langle {x}_k, {x}_i{+}{x}_j \rangle\) is the projection of that individual’s vector \({x}_k\) onto \({x}_i{+}{x}_j\). The penalty \({\hat{\phi }}^{bl}_{ij}\) is a sum of such projections from all other \(n-2\) individuals (signed, due to the same convention). Orthogonal vectors have null projections, and, balanced vector sets have null sums.

Appendix 2: Supporting Material

Continuous Treatments

We can also use the previous method with continuous variables, when it is assumed that there is uncertainty over the intensity of treatments. This can be carried out either by extending \(p_y(x_i,x_j)\) directly or by considering a third Bayesian factor for treatment intensity in Eq. (11) (together with treatment balance and size). We consider the former. In Computational Learning Theory, a product distribution  [62, 63] is a distribution over \(\{0,1\}^m\) which generalizes the relationship in Eq. (8) to the non-uniform case. We define a distribution

$$\begin{aligned} \begin{aligned} {\mathcal {D}}_{ij}&= \prod _{p_i(\texttt {+}a)\le p_i(\texttt {-}a)}p_j(\texttt {+}a)\prod _{p_i(\texttt {+}a)> p_i(\texttt {-}a)}p_j(\texttt {-}a),\\ \end{aligned} \end{aligned}$$
(14)

where \(p_i(\texttt {+}a)\) is the probability that individual i has factor a and \(p_i(\texttt {-}a)\) that he doesn’t, \(a \in {\mathcal {X}}\). The first therefore indicates certainty of positive treatment status and the second of negative. Any continuous value in between corresponds to individuals with uncertain treatment statuses.

This generalization preserves the relationship in Eq. (10), where \(x_i\) becomes the observational random vector with \({x_i}(a) = p_i(\texttt {+}a) - p_i(\texttt {-}a)\). In this case, the dot-product reflects the expectation over \({\mathcal {D}}_{ij}\) instead of U  [62]. With a single observation per individual, a simple way of obtaining these vectors is unity-base normalizing (i.e., feature scaling) the observation matrix X and assuming any applicable prior for values in-between, \([\texttt {-}1,\texttt {+}1]\). This makes maximum and minimum correspond to treated and nontreated statuses, with intermediary treatments having, for example, exponentially decreasing intensities.

Implementation

The method can be carried out for all individuals in parallel with matrix operations. For results in this article, we first unity-base normalize X,

$$\begin{aligned} T^{0}(X) = 2(X - X_{\min }) \oslash (X_{\max }-X_{\min }) - 1.0, \end{aligned}$$
(15)

where \(X_{\min }\) and \(X_{\max }\) are \(m \times n\) matrices with per-column maximum and minimum values of X and \(\oslash\) is the element-wise (schur) division.

figure a

Equation (15) calculates the initial space \(T^0(X)\). Subsequent transformations are performed by gradient descent over the sample population. The resulting method is summarized in Algorithm.1. All results in this article use 10,000 iterations and a learning rate of \(\eta =0.025\). An optimized C++ version estimates a space \(T_y(X)\) for the NSW dataset in under 5 minutes on a Macbook laptop.

NSW Study Details

Table 1 20 top-variables according to \(T_y(X)\) and LASSO regression in the complete NSW, ordered by estimated effect or correlation.

The NSW was a 1970s subsidized work program, running in 15 cities across the US for 4 years. It targeted individuals with longstanding employment problems: ex-offenders, former drug addicts, recipients of welfare benefits, and school dropouts. At the time of enrollment, each NSW participant was given a retrospective baseline interview, generally covering the previous two years, followed by up to four follow-up interviews scheduled at nine-month intervals. Survey questions covered demographic and behavior topics such as age, sex, race, marital status, education, number of children, employment history, job search, job training, mobility, housing, household, welfare assistance, military discharge status, drug use and extralegal activities. Most questions were objective and probed for specific information loosely around the previous themes (e.g., ‘what kind of school are you going to? 1 = high school, 2 = vocational, 3 = college, 99 = other’, ‘was heroin used in the last 30 days?’ etc.) Some questions were subjective (e.g., ‘tell me how important each one is to you. knowing the right people, education, luck, hard work, ...’)

To assemble control surrogates for the NSW, Lalonde used the Panel Study of Income Dynamics (PSID), a household survey, and the Westat’s matched Current Population Survey-Social Security Administration file (CPS). He drew 3 subsamples from each the PSID and CPS (6 in total). Control groups had 450, 550, 726, 2666, 2787 and 16289 individuals.Footnote 7 Lalonde ex ante assumptions for the NSW, PSID and CPS regarded mainly participants’ assignment date, gender, retirement status, age and prior wages. DW added further assumptions regarding prior wages for the NSW and used Lalonde’s control groups.

For the ‘missing causes’ study, we first estimated a model \(T_y(X)\) where

$$\begin{aligned} {\mathcal {X}}=\{all\, 1231\, variables\, in\, the\, NSW\, dataset\}. \end{aligned}$$
(16)

We use the same outcome variable y as Lalonde, post-program annual earnings (in 1982 dollars). While using all NSW variables (i.e., the answer to every survey question), we only restrict them in one way. The restriction doesn’t reduce the participant and variable counts. We ignore any variable values that are negative or ‘99’, taking them as omitted—these values are then mapped to \(x(a) = 0\) values according to Eq. (15). These correspond to unknown, not responded or exceptional values in the survey. We assume SFE should be able to handle other types of entry. Most variables are binary and naturally normalized to \([\texttt {-}1,\texttt {+}1]\) according to Eq. (15). Other variables are coded to reflect a spectrum (e.g., ‘even though the 1000 could result in arrest, how likely is that you would take the chance? 1 = very likely, 2 = somewhat likely, 3 = somewhat unlikely, 4 = not likely at all’) and they are accordingly mapped to \([\texttt {-}1,\texttt {+}1]\). Continuous and count variables are similarly linearly normalized to fit the interval (with maxima mapped to +1 and minima to -1). Following Lalonde’s protocol, we ‘annualized’ the data. Participants’ assignment date and location are not in the NSW data (only the participant’s relative time in the program). Lalonde recovered site locations and assignment years by matching reported sites’ unemployment to unemployment in Earnings and Employment magazines. This is described in detail in  [24]. Annualization allowed Lalonde to select only the 1975 participants. We, instead, added participants’ estimated year of assignment and program site location as extra variables.

Table 1 (first row) lists the 20 variables with largest ATE(a) in the NSW, ordered by effect size. We also show the output of a LASSO estimator, as a more typical model selection procedure, Table 1 (second row). The NSW treatment indicator appears as the third most influential variable, but it doesn’t appear in the list of variables selected by LASSO, Table 1. Effect sizes relate to norms in \(T_y(X)\) and dependence among variables to angles.Footnote 8 To compare a SFE-devised DGP with Lalonde’s, we next select a variable set of the same size as the one used by Lalonde. We choose effective and non-redundant causes. Let then M be a set of unrelated variables \(M \subset {\mathcal {X}}\), where \({\mathcal {X}}\) is the previous set of 20 effective variables. Furthermore, let \(M^{k=K}({\mathcal {X}})\) be a subset of \({\mathcal {X}}\) with K variables and

$$\begin{aligned} M^k(X) = M^{k-1}(X) \cup \mathop {\text {arg}\,\text {min}}\limits _{\begin{array}{c} b\in {\mathcal {X}}-M^{k-1},\\ a \in M^{k-1} \end{array}} \cos ^2(a,b), \end{aligned}$$
(17)

where \(M^0(X)=\{a\}\) and a is the variable with highest ATE(a). We use Eq. (17) and \(0 < K \le 7\), which selects the bolded variables in Table 1(first row). We use \(K=7\) to match Lalonde’s model size. This is a simple variable selection method. It uses only the estimated causal effects and expected dependence among variables. Across-population heterogeneity and others factors readily available in the representation could play roles in more sophisticated criteria.

The method leads to the following selected variables (ordered by the greedy selection),

$$\begin{aligned} {\mathcal {X}}=\{ work\_ethics, african\_american, school, worked, drink, nyc\}. \end{aligned}$$
(18)

The work_ethics variable is related to the survey question ‘I’ll read a list of things some people feel are important in getting ahead in life. Tell me how important each one is to you?’ The answer follows the scale {important, unknown, not important} and the variable corresponds to the item ‘hard work’. Other items were ‘luck’, ‘education’, ‘knowing the right people’ and ‘knowing the community’ (the item ‘education’ also appeared as a top-20 variable, Table 1). The african_american variable indicates the participant’s race, similar to a variable selected by Lalonde. The school variable indicates whether the participant was in school within the last 6 months. The worked variable indicates whether the participant worked less than 40 h in the previous 4 weeks (prior to assignment). The drink variable indicates the participant’s answer to ‘do you ever drink beer, wine, gin or other hard liquor?’ The nyc variable indicates that the participant’s NSW site location was New York City. Another location (Philadelphia) and an assignment year (1976) also appears in the top-20 list.

Similar to Lalonde, we established a correspondence between the NSW and the PSID for these variables. We ignored nyc as the PSID has no public location information. We mapped hardwork to the ‘earning acts’ PSID variable (V2941). It is an aggregate of indicators: ‘[Family] head seldom or never late for work, head rarely or never fails to go to work when not sick, head has extra jobs, head likes to do difficult or challenging things, etc.’ And we mapped drink to PSID’s annual expenditures on alcoholic beverages variable (V2472) divided by income.

The estimates for this alternative model specification across the previous two samples are depicted in Fig. 5 (main text). These results confirm some of the reasons Heckman et al.  [17, 56] put forward to explain the poor performance of matching estimators in Lalonde’s NSW subsample: ‘locations in different labor markets’ appear as an effective factor and that the expansion of the ‘limited selected observed variables’ can improve methods’ accuracy. Results suggest that issues like these can, however, be overcome by observational methods by considering missing causes, non-causes, and how to identify them. They also suggest that SFE can be used for both estimate causal effects and help with model specifications for other estimators.

Analysis: Bias-Variance Tradeoff

We now motivate the choice to model pairwise individual differences with an alternative analytic argument. Consider an outcome difference predictor \({\hat{y}}_{ij}\) for individual i (i.e., for outcome differences from others). Most effect estimators consider the least-biased estimate for a population. We consider, instead, what would be the least-biased estimate for an individual. Repeated samples of \({x_i, y_i}\) can increase the estimator’s accuracy. As assumed in Rubin’s framework  [29], these are rarely available, while observations from other individuals are often abundant. We, therefore, consider the error incurred by i when using an observation from a second individual j. Properties for the following dot-product-based estimator are well known  [48](2001, p. 50), as well as the relationship to the Gram-Schmidt procedure (the same results can also be derived through product distributions  [62]). What distinguishes the following is the formulation of an estimator for outcome differences, \(y_{ij}\), as opposed to outcomes, \(y_i\).

For i and an observational pair (ij), the observed outcome difference \(y_{ij}\) can only be due to attributes in \(x_i\) not present in \(x_j\). This leads to a ‘counterfactual’ estimator for effects at the individual-level. According to the estimator, the observed outcome difference between individuals is due to the effect of attributes that only i has, minus the effect of attributes that only j has, \(y_{ij} \sim f(x_i{\ominus }x_j) - f(x_j{\ominus }x_i)\).

For an individual i, observations from other individuals lead to the effect predictor

$$\begin{aligned} \begin{aligned} {\hat{y}}_{ij} = \frac{1}{\vert x_j-x_i \vert } \sum _{a\in {\mathcal {X}}} f_i(a)x_i(a) + \varepsilon _{ij} = \frac{\langle f_i, x_i-x_j \rangle }{\vert x_i-x_j \vert } + \varepsilon _{ij}, \end{aligned} \end{aligned}$$
(19)

where \(f_i \in {\mathbb {R}}^m\) is an individual vector of effects in y-scale, \(x \in [\texttt {-}1,\texttt {+}1]^m\), \(E(\varepsilon )=0\) and \(Var(\varepsilon )=0\). Due to the sign convention, the estimator sums the effects of variables in \(x_i\) but not in \(x_j\), subtracts the effects of variables in \(x_j\) but not in \(x_i\) and cancels out the effect of variables in both.

The estimator’s squared error loss can be decomposed in 3 components corresponding to a heterogeneity bias, variance and irreducible error \(\varepsilon _{ij}\),

$$\begin{aligned} \begin{aligned} Err_{ij}&= E[( y_{ij}- {\hat{y}}_{ij})^2 ],\\&= E\Big [ \Big ( y_{ij} - \frac{\langle f_i, (x_i-x_j) \rangle }{\langle x_i-x_j \rangle } \Big )^2 \Big ],\\&= [Heter_{ij}^2 + Var_{ij}^2 + \varepsilon _{ij}^2 ],\\ Heter_{ij}&= E[{\hat{y}}_{ij}] - y_{ij},\\&=\Big [\frac{1}{2}\Big ( \frac{\langle f_i, x_i-x_j \rangle }{\langle x_i-x_j \rangle } - \frac{\langle f_j, x_i-x_j \rangle }{\langle x_i-x_j \rangle } \Big )\Big ]^2 - y_{ij},\\&= \Big [\frac{1}{2}\frac{\langle f_i-f_j, x_i - x_j \rangle }{\langle x_i-x_j \rangle }\Big ]^2 - y_{ij},\\&= \frac{1}{4}\vert f_i-f_j \vert ^2 cos(\theta _{ij})^2 -y_{ij},\\ Var_{ij}&= E[{\hat{y}}_{ij} - E[ {\hat{y}}_{ij}]],\\&= \frac{\varepsilon _{ij}^2}{\vert x_i - x_j \vert ^2}. \end{aligned} \end{aligned}$$
(20)

where \(\theta _{ij}\) is the angle between vectors \(f_i\) and \(x_i-x_j\). The second term is a squared heterogeneity bias, the y-amount by which the estimate differs from the mean using other individuals’ effects. The last term is the variance, the expected squared deviation around the estimated mean in y-amounts.

According to this, variance and heterogeneity are related to two distances (norms of position differences) between individuals i and j. These are, in turn, related to spaces X and \(T_y(X)\). Variance is related to distances in the \([\texttt {-}1,\texttt {+}1]^m\) covariate space, \(\vert x_i - x_j \vert\), and heterogeneity to distances over effects, \(\vert f_i - f_j \vert\). Larger distances in X correspond to treatments over more variables, decreasing the estimator’s variance in Eq. (19). Larger distances in \(T_y(X)\) correspond to estimates between more heterogeneous individuals, increasing the heterogeneity bias. Decreasing this bias increases the estimate’s external validity (for the individual, not the sample population), while decreasing variance increases its internal validity.

Particularly, Eq. (20) suggest that, for a given \(\theta _{ij}\), there are two sources of bias: the difference in variable effects among individuals, \(f_i(a)-f_j(a)\), and the covariate space dimension, m. For the former, an estimate with minimal \(Heter_{ij}\) must have \(y_{ij} = \frac{1}{4}\vert f_i-f_j \vert\). This corresponds to the minimized residual illustrated in Fig. 7c and implemented by Eq. (13). For m, decreasing the space to a dimension \(d < m\) can increase the variance, \(Var_{ij}\), in Eq. (19) but decrease \(Heter_{ij}\). This motivated the introduction of the treatment size penalty \(\phi ^{cx}_{ij}\). Both the decision to use only variables that differ among pairs and the proposed optimization procedure can therefore be seen as attempts to reduce individual heterogeneity bias, \(Heter_{ij}\).

Heterogeneity is also decreasing with \(\cos ^2\theta _{ij}\), which, in turn, reflects statistical correlation. This indicates that heterogeneity is maximal for individuals with highly correlated variables (e.g., sharing many attributes) that observe different effects. This motivated the introduction of the treatment balance penalty \(\phi ^{bl}_{ij}\), which penalized non-orthogonal pairs, as well as the variable selection criteria in Eq. (17). Together, these considerations suggest a metric space as representation for a sample population, consisting of a set of orthogonal dimensions with correlated covariates (\(\cos ^2\theta _{ij} \approx 1\)) that are minimally heterogeneous (\(y_{ij} \approx \frac{1}{4}\vert f_i-f_j \vert \;\)).

Starting with a single individual i and her individual sample \(\{x_i,y_i\}\), we start with maximal internal-validity. As we increase the population scope and consider other individuals’ samples \(\{x_j,y_j\}\), we can increase estimates’ external validity. Learning a representation for individual differences, \({\hat{y}}_{ij}\), allowed for more accurate (individual) effect estimates while inter-individual effect differences, \(Heter_{ij}\), were minimized explicitly. This lead to a space that is ‘minimal’ but that still reflects observed outcome differences, \(y_{ij}\).

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

F. Ribeiro, A., Neffke, F. & Hausmann, R. What Can the Millions of Random Treatments in Nonexperimental Data Reveal About Causes?. SN COMPUT. SCI. 3, 421 (2022). https://doi.org/10.1007/s42979-022-01319-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01319-2

Keywords

Navigation