1 Inferences from Correlations and Regressions

Many sciences are full of correlations. The first question, when confronted with an observed correlation, is whether it is a random effect of the sampling or whether the observed correlation reflects a true correlation in the entire population. This distinction is very important to keep in mind when discussing possible causal links.

This question, whether there is a correlation in the entire population or not, cannot be answered with complete certainty. However, if the sampling is truly random, one can calculate a confidence interval for the correlation coefficient in the population, conditional on the observed one. The method is thoroughly described in textbooks on statistical inference. In Appendix C you will find an example of this calculation. In what follows we take for granted that a correlation is observed in a sample, one has inferred that a correlation obtains in the entire population and this conclusion is correct.

Why randomisation?

It was Fisher (1925) and Neyman (1923) who first stated that randomisation in sampling is necessary if one wants to make reliable inferences from a sample to the population. The reason is that one needs a probability distribution function when performing this inference, and which one should one choose?

One may conceive of the actual sample as one of numerous ones and if the sampling is random, the probability distribution for the mean values of these imagined samples is normally distributed with the same mean as that of the population. Furthermore, Fisher derived an equation relating the standard deviation of the sample s to the standard deviation in the population, \(\sigma \). This means that if the sampling is random, one can use a normal distribution for calculating confidence intervals for mean values of different quantitative attributes. In particular, if we have a correlation between two variables in the sample one can calculate a confidence interval for the true correlation in the entire population.

Suppose we have done that and found a substantial correlation between two variables in the entire population. The next question is: what possible mechanisms can produce a correlation between two attributes of objects in an entire population?

In cases where the correlation is astonishing, given our background knowledge of nature and society, many people are inclined to conclude that the correlation must be a random effect. This is certainly possible, in particular if the correlation is observed in a small sample. But remember: in the following discussion about correlation and causation, the point of departure is that the inference from the correlation in the sample to the correlation in the entire population is correct.

Now, a correlation in an entire population consisting of an unlimited number of individuals cannot be due to randomness. This is a consequence of the strong law of large numbers, which is a theorem in statistics. It says, roughly, that if one randomly chooses a number of items from a population, the observed mean value of a stochastic variable in that sample will converge to the mean value of that variable in the population, when the number of items in the sample increases. So if we have a series of samples in each of which we observe a correlation between two variables, the observed correlation will approach the correlation in the entire population. So a correlation due to randomness may occur in a limited sample, in the procedure of selecting items for the sample, but not in the entire population. If there is a correlation in the entire population, we can reject the suggestion that it is a random effect. The question is then: how could there be a correlation in an entire, perhaps infinite, population?

The received view is, and we have no arguments to the contrary, that there are three possible ways for a correlation between two variables X and Y to occur in a population:

  1. (i)

    X is one (direct or indirect) cause of Y;

  2. (ii)

    Y is one (direct or indirect) cause of X;

  3. (iii)

    X and Y have a common cause.

This is called Reichenbach’s principle, after Hans Reichenbach (1891–1953) who first formulated it. It is easy to understand why these three types of mechanisms will produce a correlation. Could there be other mechanisms still? Not as far as we know.

How, then, do we decide which alternative is the case?

One can sometimes exclude alternative (i) or (ii), if one knows the timing of individual instances. If for example individual instances of X always occur in time before corresponding individual instances of Y, then one can exclude alternative (ii), and vice versa.

In some cases one can from a well established theory infer which alternative is the case. However, this is rarely the case in SES research; the field is still in its infancy and there are few if any well established theories in this field. Nevertheless one may sometimes guess that alternative (iii) ought to be the case, because any causal link between X and Y seems utterly implausible, given general scientific knowledge.

One example, although not from SES, is the strong correlation, 0.72, between prevalence of cousin marriage and percentage of wealth in cash, as measured across Italy’s 107 provinces, see Fig. 7.1. There is no reason to believe that there is a direct causal link between these two features of Italian people’s behaviour, so Henrich (2020) assumed that it must be a common cause, namely, people’s degree of trust in foreigners. People with low trust in unknown and non-related persons are not inclined to invest their money in stock or put them in banks. Similarly, in communities with low trust in persons outside the extended family, marriage between unrelated persons are not popular and therefore is cousin marriage more prevalent. And conversely, people with high trust in other persons and institutions are more inclined to put their excess money in productive investments and are less sceptical to marriages with non-relatives. This is a plausible and testable hypothesis. Moreover, other studies have shown the same geographical variance in common trust over Italy’s provinces. Roughly, the degree of trust is higher in northern than in southern provinces of Italy, whereas the proportion of cash and of cousin marriage in lower the more to the north an Italian province is situated.

Fig. 7.1
A scatterplot of percent of wealth in cash has percentage values versus cousin marriage per 100000. R squared value is 0.6. A fitted line is present from (99, 9) to (3700, 42). Most of the dots fall above and below the line.

The correlation between cousin marriage and percentage of wealth in cash in Italy’s 107 provinces. Adapted from Blair Fix: Weird Consilience: A Review of Joseph Henrich’s ‘The WEIRDest People in the World’, https://economicsfromthetopdown.com/2022/05/20/

Information about correlations is by itself seldom of any particular interest. Such information is a means to an end, the end of obtaining information about causal relations. To a great extent this interest is driven by our desire to act in the world: we try to prohibit unpleasant future events, if possible, or we try to increase the chance of future desirable events, if possible. In order to attain such goals, we need causal information: what should we do in order to bring about, or increase the chance,Footnote 1 of a certain effect? So we are looking for information about causal links and that is driven by our interests as agents in the world.

It follows that the concept of cause is strongly related to the concepts of intervention and manipulability (cf. Sect. 6.7). We may cause a future event to occur, or at least increase its probability to occur, by doing something now. Or a present action may prohibit a future possible event, i.e., cause it not to happen.

It follows immediately that if we have a correlation between two variables X and Y and wonder whether X is a cause of Y (or vice versa, if Y-events precede corresponding X-events) we should manipulate X, i.e. make interventions, for example intentionally changing the values of the variable X and see if the values of Y changes concomitantly. This requires an experimental design.

Experimental tests is agreed to be the golden standard for testing hypotheses about causal relations. This nearly universal agreement about the optimal way of testing causal relations is no mere coincidence; it is a consequence of a core aspect of the meaning of expressions of the type ‘…is a cause of….’.

But what to do when experiments are impossible? This is very often the case in e.g., the social sciences such as economics and political science.

Correlations between economic variables are often observed and one may wonder which of all these correlations reflect causal links. It is often difficult, or impossible, to perform controlled experiments, both in macro and micro economics. There are at present two suggestions to obtain the needed information without performing carefully designed experiments: (i) observing natural experiments and (ii) controlling for covariates.

2 Natural Experiments

A natural experiment is not a consciously designed experiment, but a situation that in relevant aspects is similar to an experiment involving a test group and a control group. Here are two examples.

Example 7.1

Angrist and Pischke (2010, 13) discussed how to check the causal effect of class size on average test score in primary and secondary school. Does the size of the class, i.e. the number of pupils in a school class, have any causal effect on the average score among the pupils? Common sense has it that smaller classes leads to better scores, but in the data from US and many other countries there is no correlation between class size and score; sometimes it is even better scores in bigger classes. One cannot easily perform experiments, but the problem can be studied without conscious interventions, as is illustrated by the following case.

In Israel the class size is capped at 40, so if there are 41 students, these are divided into two classes each with circa 20 students. (Similarly, if there are 81 students, the group is divided into three classes, and so on.) One can then compare rather small classes with classes of around 40. Since the enrolment numbers to a particular school can be thought of as random, one has a situation sufficiently similar to one in which one performs a real experiment by randomly dividing schools in those with small classes and those with much bigger ones. In such circumstances one may assume that schools with different numbers of students per class are quite similar in other characteristics, hence if there is any difference in average scores, one may conclude that it is caused by differences in class size. And in fact there was a clear difference. Angrist & Pischke concluded: ‘Regression discontinuity estimates using Israeli data show a marked increase in achievement when class size falls.’ (op. cit. p. 14)

Example 7.2

The effect of informing US taxpayers that they had been fined for not carrying health insurance.

Here is a quote from Sarah Kiff in New York Times, Dec. 10, 2019, updated Dec. 13, 2019):

Three years ago, 3.9 million Americans received a plain-looking envelope from the Internal Revenue Service. Inside was a letter stating that they had recently paid a fine for not carrying health insurance and suggesting possible ways to enrol in coverage.

New research concludes that the bureaucratic mailing saved lives.

Three Treasury Department economists have published a working paper finding that these notices increased health insurance sign-ups. Obtaining insurance, they say, reduced premature deaths by an amount that exceeded any of their expectations. Americans between 45 and 64 benefited the most: For every 1648 who received a letter, one fewer death occurred than among those who hadn’t received a letter. In all, the researchers estimated that the letters may have wound up saving 700 lives.

The experiment, an unintended result of a budget shortfall, is the first rigorous experiment to find that health coverage leads to fewer deaths, a claim that politicians and economists have fiercely debated in recent years as they assess the effects of the Affordable Care Act’s coverage expansion. The results also provide belated vindication for the much-despised individual mandate that was part of Obamacare until December 2017, when Congress did away with the fine for people who don’t carry health insurance. \(\ldots \)

The budget shortfall mentioned was president Trump’s decision to reduce the budget for IRS. It had the consequence that IRS stopped sending mails to those who had been fined for not carrying health insurance, so 600,000 uninsured individuals did not get any such letter. That enabled a comparison between sending and not sending such a letter, and that provided strong evidencefor the conclusion that sending the letter caused a decrease in death rate.

3 Controlling for Covariates

Can one find out about causal relations without performing experiments and without access to information about natural experiments? Well, one can do one thing, namely, control for covariates.

The idea is that if the variable Z is a common causeof variables X and Y, we will observe that the correlation between X and Y disappears when we conditionalise on Z, which is feasible both for quantitative and category variables.

This is due to the fact that if X and Y are correlated (i.e., that the coefficient of correlation is far from zero), it holds that the joint probability P(XY) cannot be factorised. This means that either

$$\displaystyle \begin{aligned} P(XY) >P(X)P(Y) \, \, \text{(positive correlation)} \end{aligned} $$
(7.1)

or

$$\displaystyle \begin{aligned} P(XY) <P(X)P(Y) \, \, \text{(negative correlation)} \end{aligned} $$
(7.2)

while if

$$\displaystyle \begin{aligned} P(XY) = P(X)P(Y) \, \, \end{aligned} $$
(7.3)

there is no correlation between X and Y.

If we have available values of a variable Z and conditionalise on it, we might find that

$$\displaystyle \begin{aligned} P(XY|Z) =P(X|Z)P(Y|Z) \end{aligned} $$
(7.4)

i.e, that the joint probability for X and Y when conditionalised on Z is factorable. (In practice we will rarely find an exact equality. If the product \(P(X|Z)P(Y|Z)\) is close to \(P(XY|Z)\) the researcher may draw the conclusion that he has found the common cause.) If so, the variables \(X|Z\) and \(Y|Z\) are not correlated and this proves that Z was the common cause for X and Y.

But what if \(P(XY|Z)\) is not factorable? This indicates that Z was not the common cause, or not the only common cause, there might be more than one common cause. Obviously, if there are several common causes and we only control for one of them we will not find that conditionalising on that one will result in factorability.

The difficulties in controlling for covariates is discussed in many papers. One useful contribution is (Witte and Didelez, 2019), where there are links to more literature on the subject. The abstract begins:

When causal effects are to be estimated from observational data, we have to adjust for confounding. A central aim of covariate selection for causal inference is therefore to determine a set that is sufficient for confounding adjustment, but other aims such as efficiency or robustness can be important as well. In this paper, we review six general approaches to covariate selection that differ in the targeted type of adjustment set. We discuss and illustrate their advantages and disadvantages using causal diagrams.

The difficult question is of course how to discover all common causes when experiments are not possible. We can sometimes use well established theory, which gives us information of causal mechanisms. But this is no certain method, for how often can we be reasonably certain that our theory in relevant aspects is complete? In fact, if we were thus certain, we would not need any statistical analysis for determining whether a correlation indicates a causal link or not. Pearl (2000, 43) summarises our epistemological situation succinctly:

In fact, the statistical and philosophical literature has adamantly warned analysts that, unless one knows in advance all causally relevant factors or unless one can carefully manipulate some variables, no genuine causal inferences are possible (Fisher, 1951; Skyrms, 1980; Cliff, 1983; Eells and Sober, 1983; Holland, 1986; Gärdenfors, 1988; Cartwright, 1989).

Suppose a researcher has discovered a correlation between two variables and has conditionalised on all factors that according to background scientific knowledge possibly could be linked to the two correlated variables. Let us further suppose that the correlation has survived this conditionalisation; does that prove that the there is a causal link between the correlated variables? No. Our scientific background knowledge could be incomplete, there could be unknown common causes. There is no method for excluding this possibility. For if we had such a method, we could know whether our present best theory in a particular domain is complete or not. And we think that is in principle impossible.

Controlling for covariates can at most show that a correlation is not the result of a causal relation; but a positive proof of a causal relation is not possible. A thorough discussion about covariatesand causal inferences can be found in (Waernbaum, 2008) and references therein.

4 Regression Analysis

Regression analysis is common and is often interpreted as giving information about the strength of causal relations. A linear regression of the form \(Y=a+ bX\), where a and b are constants is often interpreted as telling us that X is a cause of Y and b is a measure of the strength of the causal coupling. (Often the squared coefficient of correlation\(r^2\) or the squared regression coefficient \(b^2\) are used for giving a measure of the connection.) Thus Y  is often called the response variable and X the explanatory variable. But as already pointed out, this cannot be inferred from the mere equation. It is obvious that the equation can be rewritten so that X is a function of Y . Hence, the distinction between explanatory variable (the cause) and response variable (the effect) must be based on some information not represented in the equation.

This is quite obvious from the fact that the correlation coefficient \(r_{xy}\) and regression coefficient b are related as

$$\displaystyle \begin{aligned} b= r_{xy} \frac{s_y}{s_x} \end{aligned} $$
(7.5)

where \(s_x\) and \(s_y\) are the standard deviations in X and Y  respectively. Since a correlation does not in itself tell us about any cause-effect relation, it is obvious that neither can information about a regression do that. Regression and correlation are statistical concepts; in order to make inferences about causal relations one needs additional information.

5 Heuristic: Hill’s Criteria

Our theories about complex phenomena are mostly incomplete and experiments are often not possible. So one has to rely on uncertain indicators when trying to find out causes. Sir Austin Bradford Hill (1897–1991), who started epidemiology, proposed a set of nine criteria to provide epidemiological evidence for a causal relationship between a presumed cause and an observed effect, i.e., a disease, (Hill, 1965). In particular, he demonstrated the connection between cigarette smoking and lung cancer. (And when he was convinced that smoking was a cause of lung cancer, he stopped smoking!)

His list of criteria is as follows:

  1. 1.

    Strength (effect size): A small association does not mean that there is not a causal effect, though the larger the association, the more likely that it is causal.

  2. 2.

    Consistency (reproducibility): Consistent findings observed by different persons in different places with different samples strengthens the likelihood of an effect.

  3. 3.

    Specificity: Causation is likely if there is a very specific population at a specific site and disease with no other likely explanation. The more specific an association between a factor and an effect is, the bigger the probability of a causal relationship.

  4. 4.

    Temporality: The effect has to occur after the cause (and if there is an expected delay between the cause and expected effect, then the effect must occur after that delay).

  5. 5.

    Biological gradient (dose-response relationship): Greater exposure should generally lead to greater incidence of the effect. However, in some cases, the mere presence of the factor can trigger the effect. In other cases, an inverse proportion is observed: greater exposure leads to lower incidence.

  6. 6.

    Plausibility: A plausible mechanism between cause and effect is helpful.

  7. 7.

    Coherence: Coherence between epidemiological and laboratory findings increases the likelihood of an effect.

  8. 8.

    Experiment: “Occasionally it is possible to appeal to experimental evidence”.

  9. 9.

    Analogy: The use of analogies or similarities between the observed associationand any other associations.

As already pointed out, it is now generally agreed that careful double blinded experiments with control groups is the golden standard for inferring a causal relation between a manipulated variable and an observed variable which co-varies. In fact, this criterion virtually trumps all other factors mentioned by Hill. But in situations where experiments are impossible and where no natural experiment is available, one may use the other criteria for making informed guesses. Certainty cannot be expected, but an informed guess is better than nothing, see e.g., (Schünemann et al., 2010)

6 Summary

Scientists often report that they have observed an association between two variables. This word ‘association’ means the same as ‘correlation’, so they claim to have observed a statistical correlation. But in fact they have observed a correlation in a sample, for seldom, if ever, can one observe all items in an entire population. Saying that there is an association between two variables is in fact an inference from the sample to the entire population. This inference is always somewhat uncertain.

If in fact there is a correlation in the entire population and not only a correlation in the sample, one may ask how this correlation came about? What is the mechanism? There is general (albeit perhaps not completely universal) agreement that there are three possible types of causal mechanisms that can result in a correlation between the variables X and Y in an entire population:

  1. (i)

    X is one (direct or indirect) cause of Y;

  2. (ii)

    Y is one (direct or indirect) cause of X;

  3. (iii)

    X and Y have a common cause.

Discussion Questions

  1. 1.

    Could there be any other explanations for a correlation than those mentioned in Reichenbach’s principle?

  2. 2.

    Can you suggest some explanation, other than that given by Henrich, for the correlation between cousin marriage and percentage of wealth in cash?

  3. 3.

    Why are passive observations not sufficient for inferring causal connections in social science?