Computational social scientist beware: Simpson’s paradox in behavioral data
Observational data about human behavior are often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson’s paradox, whereby the trends observed in data that have been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson’s paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson’s paradox is affecting results of analysis. The presence of Simpson’s paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies’ findings.
KeywordsStatistics Simpson’s paradox Survivorship bias Ecological fallacy Heterogeneity
The landscape of social science changed dramatically in the 21st century, when large volumes of social and behavioral data created the field of computational social science . While the bulk of the data are now digital traces of online behaviors, the accelerating instrumentation of our physical spaces is opening offline behaviors to analysis. The new data have vastly expanded the opportunities for discovery in the social sciences . Algorithms have mined behavioral data to validate theories of individual decision making and social interaction [5, 12] and produce new insights into first principles of human behavior. These insights help to better explain and predict human behavior, and eventually even help policy makers devise more effective interventions to improve wellbeing by steering behaviors towards desirable outcomes, for example, by fostering behaviors that promote healthy habits, reduce substance abuse and social isolation, improve learning, etc.
Computational social scientists, however, are facing challenges, some of which were rarely encountered by past researchers. Although behavioral data are usually massive, it is also often extremely sparse and noisy at the individual level. To uncover hidden patterns, scientists might choose to aggregate data over the entire population. For example, diurnal cycles of mood (cf Fig. 1 in ) and online activity (cf Fig. 2 in ) only become apparent once the activity of tens of thousands or even millions of people is aggregated. In the past, when behavioral data came from populations that were carefully crafted to address specific research questions , aggregation helped improve signal-to-noise ratio and uncover weak effects. Today, however, the same strategy can lead researchers to wrong conclusions. The reason for this is that current behavioral data are highly heterogeneous: it is collected from subgroups that vary widely in size and behavior. Heterogeneity is evident in practically all social data sets and can be easily recognized by its hallmark, the long-tailed distribution. The prevalence of some trait in these systems, whether the number of followers in an online social network, or the number of words used in an email, can vary by many orders of magnitude, making it difficult to compare users with small values of the trait to those with large values. As shown in this paper, heterogeneity can dramatically distort conclusions of analysis.
Simpson’s paradox [4, 18] is one important phenomenon confounding analysis of heterogeneous social data. According to the paradox, an association observed in data that have been aggregated over an entire population may be quite different from, and even opposite to, associations found in the underlying subgroups. A notorious example of Simpson’s paradox comes from the gender bias lawsuit against UC Berkeley . Analysis of graduate school admissions data seemingly revealed a statistically significant bias against women: a smaller fraction of female applicants were admitted for graduate studies. However, when admissions data were disaggregated by department, women had parity and even a slight advantage over men in some departments. The paradox arose because departments preferred by female applicants have lower admissions rates for both genders.
Simpson’s paradox also affects analysis of trends. When measuring how an outcome variable changes as a function of some independent variable, the characteristics of the population over which the trend is measured may change with the independent variable. As a result, the data may appear to exhibit a trend, which disappears or reverses when the data are disaggregated by subgroups . Vaupel and Yashin  give several illustrations of this effect. For example, a study of recidivism among convicts released from prison showed that the rate at which they return to prison declines over time. From this, policy makers concluded that age has a pacifying effect, with older convicts less likely to commit crimes. In reality, this is not the case. Instead, the population of ex-convicts is composed of two subgroups with nearly constant, but very different recidivism rates. The first subgroup—the “reformed”—will never commit a crime once released from prison. The other subgroups—the “incorrigibles”—are highly likely to commit a crime. Over time, as “incorrigibles” commit offenses and return to prison, there are fewer of them left in the population. Survivor bias changes the composition of the population, creating an illusion of an overall decline in recidivism. As Vaupel and Yashin warn, “unsuspecting researchers who are not wary of heterogeneity’s ruses may fallaciously assume that observed patterns for the population as a whole also hold on the sub-population or individual level.”
To highlight the perils of ignoring Simpson’s paradox, I describe several studies of online behavior in which the trends discovered in aggregate data lead to wrong conclusions about behavior. For decision makers and platform designers seeking to use research findings to inform policy, incorrect interpretation can lead to counterproductive choices where a policy thought to enhance some behavior instead suppresses it, or vice-versa. To identify such cases, I present a simple method researchers can use to test for the presence of the paradox in their data. When paradox is confirmed, analysis should be performed on the stratified data that have been disaggregated by subgroups [1, 18]. Testing and controlling for Simpson’s paradox should be part of every computational social scientist’s toolbox.
Examples of Simpson’s paradox
Content consumption in social media A study of content consumption on a popular social networking site Facebook examined the time users devote to viewing each item in their social feed . The study segmented each user’s activity into sessions, defined as sequences of activity without a prolonged break (see Fig. 4 for an explanation). At a population level, it looks as if users slow down over the course of a session, taking more and more time to view each item (Fig. 2a). However, when looking at user activity within sessions of the same length, e.g., sessions that are 30 min long, it appears that individuals speed up instead (Fig. 2b). As the session progresses, they spend less and less time viewing each item, which suggests that they begin to skim posts.
Answer quality on Stack Exchange Stack exchange is a popular question-answering platform where users ask and answer questions. Askers can also “accept” an answer as the best answer to their question. A study of dynamics of user performance on Stack Exchange found that answer quality, as measured by the probability that it will be accepted by the asker as the best answer, declines steadily over the course of a session, with each successive answer written by a user ever less likely to get accepted . However, this trend is seen only when comparing sessions of the same length, for example, sessions where exactly four answers were written (Fig. 3b). When calculating answer acceptance probability over all the data, it looks as though answers written later in a session are more likely to get accepted (Fig. 3a). Here, the length of the session confounds analysis: users who have longer sessions write answers that are more likely to be accepted.
Testing data for Simpson’s paradox
When can a cautious researcher accept results of analysis? I describe a simple test that can help ascertain whether a pattern observed in data is robust or potentially a manifestation of Simpson’s paradox. The test creates a randomized version of the data by shuffling it with respect to the attribute for which the trend is measured. Shuffling preserves the distribution of features, but destroys correlation between the outcome variable and that attribute. As a result, any trends with respect to that attribute should disappear. This suggests a rule of thumb: if the trend persists in the aggregate data, but disappears when the shuffled data are disaggregated, then Simpson’s paradox may be present.
In the analyses described above, the independent variable was time, or a proxy of it, such as the point within a session when the action takes place. There are at least two different randomization strategies with respect to time. The first strategy creates randomized session data by preserving the temporal order of actions, but shuffling the time intervals between them, as shown in Fig. 4 (middle row). Since session break is defined as a sufficiently long time interval between actions, shuffling time intervals will merge sessions and break up longer sessions, while preserving the sequence of actions. The second strategy creates a randomized index data by shuffling the order of actions within a session, e.g., exchanging \(C_1\) by \(C_3\) in Fig. 4 (bottom row).
Online shopping A study of online shopping examined whether individual purchasing decisions are constrained by finances. The study looked at the relationship between purchase price of an item and the time interval since last purchase . Budgetary constraints would force a user to wait after making a purchase to accumulate enough money for another purchase. Figure 5a reports (normalized) purchase price of an item as a function of the time since last purchase (red line). The longer the delay, the larger the fraction of the budget users spend on a purchase, which appears to support the hypothesis.
To test the robustness of this finding, the data were shuffled by randomly swapping the prices of products purchased by users, which destroys the correlation between the time between purchases and purchase price. Surprisingly, the trend remains (blue line). This is due to heterogeneity of the underlying population: the population represents a mix of users with different purchasing habits. The frequent buyers purchase cheaper items more frequently, and they are systematically overrepresented on the left side of the plot, even in shuffled data.
Stack Exchange To test robustness of trends shown in Fig. 3, which reports how acceptance probability of an answer posted on Stack Exchange changes over the course of a session, we randomize data by shuffling the time intervals between answers posted by each user, while preserving other features, including the temporal order of answers. The randomization procedure changes sessions by breaking up longer sessions and concatenating shorter ones. By changing which sequence of answers is considered to belong to a session, we expect randomization to change the observed trends in acceptance probability.
Reddit comments A similar quality deterioration effect was observed for comments posted on Reddit. Regardless of what measure is used as a proxy of quality—comment length, the number of responses or upvotes from others it receives, its textual complexity—the quality of each successive comment written by a Reddit user decreases over the course of a session . To test the robustness of this finding, Singer et al. randomized Reddit activity data. Figure 7 compares the trends for the proxies of comment quality in the original data to those in the randomized data. Both data sets have been disaggregated by session length. The decreasing trends observed in the original Reddit data (top row) largely disappear in the randomized data (bottom row). Where the trends still exist, the deterioration effect is much reduced. This suggests that most of data heterogeneity is captured by session length.
Simpson’s paradox can indicate that interesting patterns exist in data , but it can also skew analysis. The paradox suggests that data come from subgroups that differ systematically in their behavior, and that these differences are large enough to affect analysis of aggregate data. In this case, the trends discovered in disaggregated data are more likely to describe—and predict—individual behavior than the trends found in aggregate data. Thus, to build more robust models of behavior, computational social scientists need to identify confounding variables which could affect observed trends. The shuffle test described in this paper provides a framework for determining whether Simpson’s paradox is affecting conclusions.
Many people have contributed along the way to identifying the problem of Simpson’s paradox in data analysis, investigating it empirically, as well as devising methods to mitigate its effects. These people include Nathan Hodas, Farshad Kooti, Keith Burghardt, Philipp Singer, Emilio Ferrara, Peter Fennell, Nazanin Alipourfard. This work was funded, in part, by Army Research Office under contract W911NF-15-1-0142.
- 1.Alipourfard, N., Fennell, P., & Lerman, K. (2018). Can you trust the trend? Discovering Simpson’s paradoxes in social data. In Proceedings of the 11th International ACM Conference on Web Search and Data Mining. ACMGoogle Scholar
- 2.Barbosa, S., Cosley, D., Sharma, A., & Cesar, R.M., Jr. (2016) Averaging gone wrong: Using time-aware analyses to better understand behavior. In Proceedings of the World Wide Web Conference (pp. 829–841), April 2016.Google Scholar
- 6.Fabris, C., & Freitas, A. (2000). Discovering surprising patterns by detecting occurrences of simpson’s paradox. In M. Bramer, A. Macintosh, & F. Coenen (Eds.), Research and Development in Intelligent Systems XVI (pp. 148–160). London: SpringerGoogle Scholar
- 7.Ferrara, E., Alipourfard, N., Burghardt, K., Gopal, C., & Lerman, K. (2017). Dynamics of content quality in collaborative knowledge production. In Proceedings of 11th AAAI International Conference on Web and Social Media. AAAIGoogle Scholar
- 9.Hodas, N.O.. & Lerman, K. (2012). How limited visibility and divided attention constrain social contagion. In ASE/IEEE International Conference on Social Computing Google Scholar
- 10.Hodas, N.O., & Lerman, K. (2014). The simple rules of social contagion. Scientific Reports, 4, 4343.Google Scholar
- 12.Kleinberg, J., Himabindu, L., Jure, L. Jens, L., & Sendhil, M. (2017). Human decisions and machine predictions. National Bureau of Economic Research: Technical report.Google Scholar
- 13.Kooti, F., Lerman, K., Aiello, L.M., Grbovic, M., Djuric, N., & Radosavljevic, V. (2016). Portrait of an online shopper: Understanding and predicting consumer behavior. In The 9th ACM International Conference on Web Search and Data Mining Google Scholar
- 14.Kooti, F., Subbian, K., Mason, W., Adamic, L., & Lerman, K. (2017). Understanding short-term changes in online activity sessions. In Proceedings of the 26th International World Wide Web Conference (Companion WWW2017) Google Scholar
- 18.Norton, J.H., & Divine, G. (2015). Simpson’s paradox... and how to avoid it. Significance, 12(4), 40–43.Google Scholar
- 19.Rodriguez, M.G., Gummadi, K., Schoelkopf, B. (2014). Quantifying information overload in social media and its impact on social contagions. In Proceedings of Eighth International AAAI Conference on Weblogs and Social Media Google Scholar
- 20.Romero, D.M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In Proceedings of the 20th International Conference on World Wide Web (pp. 695–704), New York, NY, USA: ACM.Google Scholar
- 22.Ver Steeg, G., Ghosh, R., & Lerman, K. (2011). What stops social epidemics? In Proceedings of 5th International Conference on Weblogs and Social Media. AAAIGoogle Scholar
- 23.Vaupel, J. W., & Yashin, A. I. (1985). Heterogeneity’s ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3), 176–185.Google Scholar