Journal of Computational Social Science

, Volume 1, Issue 1, pp 49–58 | Cite as

Computational social scientist beware: Simpson’s paradox in behavioral data

  • Kristina LermanEmail author
Survey Article


Observational data about human behavior are often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson’s paradox, whereby the trends observed in data that have been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson’s paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson’s paradox is affecting results of analysis. The presence of Simpson’s paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies’ findings.


Statistics Simpson’s paradox Survivorship bias Ecological fallacy Heterogeneity 


The landscape of social science changed dramatically in the 21st century, when large volumes of social and behavioral data created the field of computational social science [15]. While the bulk of the data are now digital traces of online behaviors, the accelerating instrumentation of our physical spaces is opening offline behaviors to analysis. The new data have vastly expanded the opportunities for discovery in the social sciences [17]. Algorithms have mined behavioral data to validate theories of individual decision making and social interaction [5, 12] and produce new insights into first principles of human behavior. These insights help to better explain and predict human behavior, and eventually even help policy makers devise more effective interventions to improve wellbeing by steering behaviors towards desirable outcomes, for example, by fostering behaviors that promote healthy habits, reduce substance abuse and social isolation, improve learning, etc.

Computational social scientists, however, are facing challenges, some of which were rarely encountered by past researchers. Although behavioral data are usually massive, it is also often extremely sparse and noisy at the individual level. To uncover hidden patterns, scientists might choose to aggregate data over the entire population. For example, diurnal cycles of mood (cf Fig.  1 in [8]) and online activity (cf Fig.  2 in [11]) only become apparent once the activity of tens of thousands or even millions of people is aggregated. In the past, when behavioral data came from populations that were carefully crafted to address specific research questions [17], aggregation helped improve signal-to-noise ratio and uncover weak effects. Today, however, the same strategy can lead researchers to wrong conclusions. The reason for this is that current behavioral data are highly heterogeneous: it is collected from subgroups that vary widely in size and behavior. Heterogeneity is evident in practically all social data sets and can be easily recognized by its hallmark, the long-tailed distribution. The prevalence of some trait in these systems, whether the number of followers in an online social network, or the number of words used in an email, can vary by many orders of magnitude, making it difficult to compare users with small values of the trait to those with large values. As shown in this paper, heterogeneity can dramatically distort conclusions of analysis.

Simpson’s paradox [4, 18] is one important phenomenon confounding analysis of heterogeneous social data. According to the paradox, an association observed in data that have been aggregated over an entire population may be quite different from, and even opposite to, associations found in the underlying subgroups. A notorious example of Simpson’s paradox comes from the gender bias lawsuit against UC Berkeley [3]. Analysis of graduate school admissions data seemingly revealed a statistically significant bias against women: a smaller fraction of female applicants were admitted for graduate studies. However, when admissions data were disaggregated by department, women had parity and even a slight advantage over men in some departments. The paradox arose because departments preferred by female applicants have lower admissions rates for both genders.

Simpson’s paradox also affects analysis of trends. When measuring how an outcome variable changes as a function of some independent variable, the characteristics of the population over which the trend is measured may change with the independent variable. As a result, the data may appear to exhibit a trend, which disappears or reverses when the data are disaggregated by subgroups [1]. Vaupel and Yashin [23] give several illustrations of this effect. For example, a study of recidivism among convicts released from prison showed that the rate at which they return to prison declines over time. From this, policy makers concluded that age has a pacifying effect, with older convicts less likely to commit crimes. In reality, this is not the case. Instead, the population of ex-convicts is composed of two subgroups with nearly constant, but very different recidivism rates. The first subgroup—the “reformed”—will never commit a crime once released from prison. The other subgroups—the “incorrigibles”—are highly likely to commit a crime. Over time, as “incorrigibles” commit offenses and return to prison, there are fewer of them left in the population. Survivor bias changes the composition of the population, creating an illusion of an overall decline in recidivism. As Vaupel and Yashin warn, “unsuspecting researchers who are not wary of heterogeneity’s ruses may fallaciously assume that observed patterns for the population as a whole also hold on the sub-population or individual level.”

To highlight the perils of ignoring Simpson’s paradox, I describe several studies of online behavior in which the trends discovered in aggregate data lead to wrong conclusions about behavior. For decision makers and platform designers seeking to use research findings to inform policy, incorrect interpretation can lead to counterproductive choices where a policy thought to enhance some behavior instead suppresses it, or vice-versa. To identify such cases, I present a simple method researchers can use to test for the presence of the paradox in their data. When paradox is confirmed, analysis should be performed on the stratified data that have been disaggregated by subgroups [1, 18]. Testing and controlling for Simpson’s paradox should be part of every computational social scientist’s toolbox.

Examples of Simpson’s paradox

Multiple examples of Simpson’s paradox have been identified in empirical studies of online behavior. For example, a study of Reddit [2] found that average comment length decreased over time. However, when data were disaggregated by cohorts based on the year the user joined Reddit, comment length within each cohort increases. Additional examples of Simpson’s paradox are described below.
Fig. 1

Exposure response in social media. The probability to retweet some information as a function of the number of friends who previously tweeted it has a nonmonotonic trend when averaged over all users (a), but increases monotonically when users are separated according to the number of friends they follow (b). This suggests that additional exposures increase retweet likelihood, instead of suppressing it

Exposure response in social media When examining how users spread information on a social media site Twitter, it may appear that repeated exposures to hashtags or links to online content make an individual less likely to use the hashtag himself or herself (Fig. 1 of [20]) or share the links with followers [21] (Fig. 1a). From this, one may conclude the additional exposures “inoculate” the user and suppress the sharing of information. In fact, the opposite is true: additional exposures monotonically increase the user’s likelihood to share information with followers [16]. The paradox arises because those users who follow many others—and are likely to be exposed to information or a hashtag multiple times—are less responsive overall (Fig. 1b), simply because they are overloaded with a large volume of information they receive [9]. Calculating response as a function of the number of exposures in the aggregate data falls prey to survivor bias: the more responsive users (with fewer friends) quickly drop out of analysis (since they are generally exposed fewer times), leaving only the highly connected, but less responsive users behind. Their reduced susceptibility biases aggregate response, leading to wrong conclusions about individual behavior. Once data are disaggregated based on the volume of information individuals receive [19], a clearer pattern of response emerges, one that is more predictive of behavior [10].
Fig. 2

Rate of content consumption during a session. Average time spent viewing each item in a social feed appears to increase over the course of a session when looking at all the data (a) but decreases within sessions of the same length (b). This indicates that users speed up near the end of the session, taking less and less time to view each item

Content consumption in social media A study of content consumption on a popular social networking site Facebook examined the time users devote to viewing each item in their social feed [14]. The study segmented each user’s activity into sessions, defined as sequences of activity without a prolonged break (see Fig. 4 for an explanation). At a population level, it looks as if users slow down over the course of a session, taking more and more time to view each item (Fig. 2a). However, when looking at user activity within sessions of the same length, e.g., sessions that are 30  min long, it appears that individuals speed up instead (Fig. 2b). As the session progresses, they spend less and less time viewing each item, which suggests that they begin to skim posts.

The difference in trends arises because users who have longer sessions also tend to spend more time viewing each item in their feed. When calculating how long users view items as a function of time, the faster users drop out of analysis of aggregate data, leaving the slower, users who tend to have longer sessions. Therefore, stratifying data by session length removes the confounding factor and allows us to study behavior within a similar cohort.
Fig. 3

Quality of answers on Stack Exchange. Probability that an answer is accepted as the best answer to a question increases as a function of its position within the session in the aggregated data (a) but decreases within sessions of the same length (b). This suggests that the quality of answers written by users deteriorates over the course of a session. Note that each line in the right panel represents sessions of a given length. Only sessions with five or fewer answers are shown

Answer quality on Stack Exchange Stack exchange is a popular question-answering platform where users ask and answer questions. Askers can also “accept” an answer as the best answer to their question. A study of dynamics of user performance on Stack Exchange found that answer quality, as measured by the probability that it will be accepted by the asker as the best answer, declines steadily over the course of a session, with each successive answer written by a user ever less likely to get accepted [7]. However, this trend is seen only when comparing sessions of the same length, for example, sessions where exactly four answers were written (Fig. 3b). When calculating answer acceptance probability over all the data, it looks as though answers written later in a session are more likely to get accepted (Fig. 3a). Here, the length of the session confounds analysis: users who have longer sessions write answers that are more likely to be accepted.

Fig. 4

Data randomization for the shuffle test. The top row shows the original stream of user actions \(C_1, \ldots , C_4\). A session is a sequence of actions without an extended break, e.g., 60 min. Here, user actions \(C_1\) through \(C_3\) are assigned to one session, while \(C_4\) is assigned to a new session. The middle row shows data randomization strategy that shuffles time intervals between actions while preserving their order. This tends to change the definition of sessions. The bottom row shows the second randomization strategy, which shuffles the order of actions within sessions, while preserving the time intervals between actions

Testing data for Simpson’s paradox

When can a cautious researcher accept results of analysis? I describe a simple test that can help ascertain whether a pattern observed in data is robust or potentially a manifestation of Simpson’s paradox. The test creates a randomized version of the data by shuffling it with respect to the attribute for which the trend is measured. Shuffling preserves the distribution of features, but destroys correlation between the outcome variable and that attribute. As a result, any trends with respect to that attribute should disappear. This suggests a rule of thumb: if the trend persists in the aggregate data, but disappears when the shuffled data are disaggregated, then Simpson’s paradox may be present.

In the analyses described above, the independent variable was time, or a proxy of it, such as the point within a session when the action takes place. There are at least two different randomization strategies with respect to time. The first strategy creates randomized session data by preserving the temporal order of actions, but shuffling the time intervals between them, as shown in Fig. 4 (middle row). Since session break is defined as a sufficiently long time interval between actions, shuffling time intervals will merge sessions and break up longer sessions, while preserving the sequence of actions. The second strategy creates a randomized index data by shuffling the order of actions within a session, e.g., exchanging \(C_1\) by \(C_3\) in Fig. 4 (bottom row).

Below I illustrate the shuffle test with real-world examples. I show that when the data are shuffled, the trend still persists in the aggregate data, but disappears, as expected, when the shuffled data are disaggregated.
Fig. 5

Online shopping. Relationship between purchase price and time to next purchase in data (red line) and in the shuffled data (blue line), in which the purchase prices of items were randomly shuffled. The positive trend seen in the aggregate data (a) still persists when data are shuffled. However, when data are disaggregated by the number of purchases, specifically, users who made exactly five purchases (b), the trend disappears in the shuffled data.

Online shopping A study of online shopping examined whether individual purchasing decisions are constrained by finances. The study looked at the relationship between purchase price of an item and the time interval since last purchase [13]. Budgetary constraints would force a user to wait after making a purchase to accumulate enough money for another purchase. Figure 5a reports (normalized) purchase price of an item as a function of the time since last purchase (red line). The longer the delay, the larger the fraction of the budget users spend on a purchase, which appears to support the hypothesis.

To test the robustness of this finding, the data were shuffled by randomly swapping the prices of products purchased by users, which destroys the correlation between the time between purchases and purchase price. Surprisingly, the trend remains (blue line). This is due to heterogeneity of the underlying population: the population represents a mix of users with different purchasing habits. The frequent buyers purchase cheaper items more frequently, and they are systematically overrepresented on the left side of the plot, even in shuffled data.

To stratify data, buyers were grouped by the number of purchases they make, for example, those making exactly five purchases (Fig. 5b). The positive trend between the normalized purchase price and time seen in the disaggregated data (red line) disappears in the shuffled data (blue line), giving unbiased support for the limited budget hypothesis.
Fig. 6

Answer’s acceptance probability as a function of its session index in the randomized Stack Exchange data. The left panel shows that the upward trend seen in Fig. 3 is preserved in the aggregate shuffled data. However, when shuffled data are disaggregated by session length (b), the trends largely disappear

Stack Exchange To test robustness of trends shown in Fig. 3, which reports how acceptance probability of an answer posted on Stack Exchange changes over the course of a session, we randomize data by shuffling the time intervals between answers posted by each user, while preserving other features, including the temporal order of answers. The randomization procedure changes sessions by breaking up longer sessions and concatenating shorter ones. By changing which sequence of answers is considered to belong to a session, we expect randomization to change the observed trends in acceptance probability.

The upward trend in acceptance probability seen in aggregate data still exists in the randomized data (Fig. 6a), even though the trends in randomized data disappear, as expected, when data are disaggregated by session length (Fig. 6b). This confirms the need for stratifying data by session length in analysis.
Fig. 7

Deterioration in comment quality on Reddit. When data are disaggregated by length of the session (different color lines), the quantitative proxies of comment quality decline over the course of a session. The x-axis represents index of the comment within a session, and the y-axis gives the average value of the proxy measure (with error bars). The declines observed in original Reddit data (top row) mostly disappear when data are randomized (bottom row)

Reddit comments A similar quality deterioration effect was observed for comments posted on Reddit. Regardless of what measure is used as a proxy of quality—comment length, the number of responses or upvotes from others it receives, its textual complexity—the quality of each successive comment written by a Reddit user decreases over the course of a session [22]. To test the robustness of this finding, Singer et al. randomized Reddit activity data. Figure 7 compares the trends for the proxies of comment quality in the original data to those in the randomized data. Both data sets have been disaggregated by session length. The decreasing trends observed in the original Reddit data (top row) largely disappear in the randomized data (bottom row). Where the trends still exist, the deterioration effect is much reduced. This suggests that most of data heterogeneity is captured by session length.


Simpson’s paradox can indicate that interesting patterns exist in data [6], but it can also skew analysis. The paradox suggests that data come from subgroups that differ systematically in their behavior, and that these differences are large enough to affect analysis of aggregate data. In this case, the trends discovered in disaggregated data are more likely to describe—and predict—individual behavior than the trends found in aggregate data. Thus, to build more robust models of behavior, computational social scientists need to identify confounding variables which could affect observed trends. The shuffle test described in this paper provides a framework for determining whether Simpson’s paradox is affecting conclusions.



Many people have contributed along the way to identifying the problem of Simpson’s paradox in data analysis, investigating it empirically, as well as devising methods to mitigate its effects. These people include Nathan Hodas, Farshad Kooti, Keith Burghardt, Philipp Singer, Emilio Ferrara, Peter Fennell, Nazanin Alipourfard. This work was funded, in part, by Army Research Office under contract W911NF-15-1-0142.


  1. 1.
    Alipourfard, N., Fennell, P., & Lerman, K. (2018). Can you trust the trend? Discovering Simpson’s paradoxes in social data. In Proceedings of the 11th International ACM Conference on Web Search and Data Mining. ACMGoogle Scholar
  2. 2.
    Barbosa, S., Cosley, D., Sharma, A., & Cesar, R.M., Jr. (2016) Averaging gone wrong: Using time-aware analyses to better understand behavior. In Proceedings of the World Wide Web Conference (pp. 829–841), April 2016.Google Scholar
  3. 3.
    Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from berkeley. Science, 187(4175), 398–404.CrossRefGoogle Scholar
  4. 4.
    Blyth, C. R. (1972). On simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338), 364–366.CrossRefGoogle Scholar
  5. 5.
    Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., et al. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295–298.CrossRefGoogle Scholar
  6. 6.
    Fabris, C., & Freitas, A. (2000). Discovering surprising patterns by detecting occurrences of simpson’s paradox. In M. Bramer, A. Macintosh, & F. Coenen (Eds.), Research and Development in Intelligent Systems XVI (pp. 148–160). London: SpringerGoogle Scholar
  7. 7.
    Ferrara, E., Alipourfard, N., Burghardt, K., Gopal, C., & Lerman, K. (2017). Dynamics of content quality in collaborative knowledge production. In Proceedings of 11th AAAI International Conference on Web and Social Media. AAAIGoogle Scholar
  8. 8.
    Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051), 1878–1881.CrossRefGoogle Scholar
  9. 9.
    Hodas, N.O.. & Lerman, K. (2012). How limited visibility and divided attention constrain social contagion. In ASE/IEEE International Conference on Social Computing Google Scholar
  10. 10.
    Hodas, N.O., & Lerman, K. (2014). The simple rules of social contagion. Scientific Reports, 4, 4343.Google Scholar
  11. 11.
    Hogg, T., & Lerman, K. (2012). Social dynamics of digg. EPJ Data Science, 1(1), 5.CrossRefGoogle Scholar
  12. 12.
    Kleinberg, J., Himabindu, L., Jure, L. Jens, L., & Sendhil, M. (2017). Human decisions and machine predictions. National Bureau of Economic Research: Technical report.Google Scholar
  13. 13.
    Kooti, F., Lerman, K., Aiello, L.M., Grbovic, M., Djuric, N., & Radosavljevic, V. (2016). Portrait of an online shopper: Understanding and predicting consumer behavior. In The 9th ACM International Conference on Web Search and Data Mining Google Scholar
  14. 14.
    Kooti, F., Subbian, K., Mason, W., Adamic, L., & Lerman, K. (2017). Understanding short-term changes in online activity sessions. In Proceedings of the 26th International World Wide Web Conference (Companion WWW2017) Google Scholar
  15. 15.
    Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., et al. (2009). Computational social science. Science, 323, 721–723.CrossRefGoogle Scholar
  16. 16.
    Lerman, K. (2016). Information is not a virus, and other consequences of human cognitive limits. Future Internet, 8(2), 21+.CrossRefGoogle Scholar
  17. 17.
    McFarland, D. A., Lewis, K., & Goldberg, A. (2016). Sociology in the era of big data: The ascent of forensic social science. The American Sociologist, 47(1), 12–35.CrossRefGoogle Scholar
  18. 18.
    Norton, J.H., & Divine, G. (2015). Simpson’s paradox... and how to avoid it. Significance, 12(4), 40–43.Google Scholar
  19. 19.
    Rodriguez, M.G., Gummadi, K., Schoelkopf, B. (2014). Quantifying information overload in social media and its impact on social contagions. In Proceedings of Eighth International AAAI Conference on Weblogs and Social Media Google Scholar
  20. 20.
    Romero, D.M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In Proceedings of the 20th International Conference on World Wide Web (pp. 695–704), New York, NY, USA: ACM.Google Scholar
  21. 21.
    Singer, P., Ferrara, E., Kooti, F., Strohmaier, M., & Lerman, K. (2016). Evidence of online performance deterioration in user sessions on reddit. PLoS ONE, 11(8), e0161636+.CrossRefGoogle Scholar
  22. 22.
    Ver Steeg, G., Ghosh, R., & Lerman, K. (2011). What stops social epidemics? In Proceedings of 5th International Conference on Weblogs and Social Media. AAAIGoogle Scholar
  23. 23.
    Vaupel, J. W., & Yashin, A. I. (1985). Heterogeneity’s ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3), 176–185.Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2017

Authors and Affiliations

  1. 1.USC Information Sciences InstituteMarina del ReyUSA

Personalised recommendations