Association Versus Causation
Scientific knowledge provides a general understanding of how the world is connected among one another. It is useful in providing a means of categorizing things (typology), a prediction of future events, an explanation of past events, and a sense of understanding about the causes of the phenomenon (causation). Association, also called correlation or covariation, is an empirical and statistical relationship between two variables such that changes in one variable are connected to changes in the other. However, association in and of itself does not necessarily imply a causal relationship between the two variables. It is only one of several necessary criteria for establishing causation. The other two criteria for causal relationships are time order and non-spurious relationships. While the advance of big data makes it possible and more effective to capture tremendous number of correlations and predictions than ever before, and statistical analyses may assess the degree of association between variables with continuous data analyzed from big datasets, one must consider the theoretical underpinning of the study and how data were collected (i.e., in a manner that measurement of an independent variable precedes measurement of a dependent variable) in order to determine if the causal relationship is valid.
The purpose of this entry is to focus on association and one function of scientific knowledge – causation, what they are, how they relate to and differ from each other, and how big data plays any role in this process.
A scientific theory is the relationships between concepts or variables in ways that describe, predict, and explain how the world operates. One type of relationships between variables is association or covariation. In this relationship, changes in the values of one variable are related to changes in the values of the other variable. In other words, the two variables shift their values together. Some statistical procedures are needed to establish association. To determine whether variable A is associated with variable B, we must see how the values of variable B shift when two or more values of variable A occur. If values in variable B shift systematically with each of the levels of variable A, then we can say there is an association between variables A and B. For example, to determine whether aggressiveness is really associated with exposure to violent television programs, we must observe aggressiveness under at least two levels of exposure to violent television programs, such as high exposure and low exposure. If higher level of aggressiveness is found under the condition of higher exposure to violent television programs than under the condition of lower exposure, we can conclude a positive association between exposure to television violence and aggressiveness. If lower level of aggressiveness is observed under the condition of higher exposure to violent television programs than under the condition of lower exposure, we can conclude a negative or inverse association between the two variables. Both situations indicate that exposure to television violence and aggressiveness are associated or covary.
To claim that variable A is a cause of variable B, the two variables must be associated with one another. If high- and low viewing of violent programs on television are equally related to level of aggressiveness, then there is no association between watching television violence and aggressiveness. In other words, knowing a person’s viewing of violent programs on television does not help in any way predicting a person’s level of aggressiveness. In this case, watching television violence cannot be a cause of aggressiveness. On the other hand, simple association between these two variables does not imply causation. Other criteria are needed to establish causation.
A dominant theoretical framework in media communication research is the agenda-setting theory. McCombs and colleagues’ research suggests that there is an association between prominent media coverage and what people tend to think about. That is, media emphasis on certain issues tends to be associated with the perceived importance of issues among the public. Recent research has examined the agenda-setting effect in the context of big data, for example, assessing the relationship between digital content produced by traditional media outlets (e.g., print; television) and user-generated content (i.e., blogs, forums, and social media). While agenda-setting research typically identifies associations between the prominence of media coverage of some issues and the importance public attaches to those issues, research designs must account for the sequence (i.e., time order) in which variables occur. For example, while it is plausible to think that media coverage influences what the public thinks about, in the age of new media, the public also plays increasingly important role in influencing what is covered by the news media outlets. Such explorations are questions of causality and would require a consideration of time order sequence between variables. Additionally, potential external causes of variation must be considered in order to truly establish causation.
A second criterion for establishing causality is that a cause (independent variable) should take place before its effect (dependent variable). This means that changes in the independent variable should influence changes in the dependent variable, but not vice versa. This is also called the direction of influence (from independent variable to dependent variable). For some relationships in social research, the time order or direction of influence is clear. For instance, one’s parents’ education always occurs before their children’s education. For others, the time order is not easy to determine. For example, while it is easy to find that viewing television violence and aggressiveness are related, it is much harder to determine which variable causes the changes in the other. One plausible explanation is that the more one views television violence, the more one imitates the violent behavior on television and becomes more aggressive (per social learning theory). An equally plausible interpretation is that an aggressive person is usually attracted to violent television programs. Without any convincing evidence about the time order or direction of influence, there is no sound basis for determining which is the cause (independent variable) and which is the effect (dependent variable).
Some research designs such as controlled experiment are easier to decide on the time order of influence. Recent research examining people’s use of mobile technology employed a field experiment to understand people’s political web-browsing behavior. For example, Hoffman and Fang tracked individuals’ web-browsing behavior over 4 months to determine predictors (e.g., political ideology) of the amount of time individuals spend browsing certain political content over others. Such research is able to establish that some preexisting characteristic predicts or manipulation causes a change to the outcome of web-browsing behavior.
This is the third essential criterion for establishing a causal relationship: when a relationship between two variables is not caused by variation in a third or extraneous variable. This means that the seeming association between two variables might be caused by a common third or extraneous variable (spurious relationship) rather than an influence of the presumed independent variable on the dependent variable. One well-known example is the association between a person’s foot size and one’s verbal ability in the 2010 US Census. If you believe that association or correlation implies causation, then you might think so. But the apparent relationship between one’s foot size and verbal ability is a spurious one because one’s foot size and verbal ability is linked to a common third variable – age. As one grows older, one’s foot size becomes larger and as one grows older, one becomes better at communicating, but there is no logical and inherent relationship between foot size and verbal ability. To return to the agenda-setting example, perhaps a third variable would influence the relationship between media issue coverage and the importance public attaches to issues. For example, perhaps the nature of issue coverage (e.g., emotional coverage; coverage of issues of personal importance) would influence what the public thinks about issues presented by the media. Therefore, when we infer a causal relationship from an observed association, we need to rule out the influence of a third variable (or rival hypothesis) that might have created a spurious relationship between the variables.
In conclusion, despite the accumulation of enormous number of associations or correlations in the era of big data, association still does not supersede causation. To establish causation, the criteria of time order and non-spurious relationships must also be met with sound theoretical foundation in the broader context of big data.
- Babbie, E. (2007). The practice of social research (11th ed.). Belmont: Wadsworth.Google Scholar
- McCombs, M. (2004). Setting the agenda: The mass media and public opinion. Cambridge, UK: Polity.Google Scholar
- Reynolds, P. (2007). A primer in theory construction. Boston: Pearson/Allyn & Bacon.Google Scholar
- Singleton, R., & Straits, B. (2010). Approaches to social research (5th ed.). New York: Oxford University Press.Google Scholar