Observational data about human behavior are often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson’s paradox, whereby the trends observed in data that have been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson’s paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson’s paradox is affecting results of analysis. The presence of Simpson’s paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies’ findings.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Alipourfard, N., Fennell, P., & Lerman, K. (2018). Can you trust the trend? Discovering Simpson’s paradoxes in social data. In Proceedings of the 11th International ACM Conference on Web Search and Data Mining. ACM
Barbosa, S., Cosley, D., Sharma, A., & Cesar, R.M., Jr. (2016) Averaging gone wrong: Using time-aware analyses to better understand behavior. In Proceedings of the World Wide Web Conference (pp. 829–841), April 2016.
Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from berkeley. Science, 187(4175), 398–404.
Blyth, C. R. (1972). On simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338), 364–366.
Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., et al. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295–298.
Fabris, C., & Freitas, A. (2000). Discovering surprising patterns by detecting occurrences of simpson’s paradox. In M. Bramer, A. Macintosh, & F. Coenen (Eds.), Research and Development in Intelligent Systems XVI (pp. 148–160). London: Springer
Ferrara, E., Alipourfard, N., Burghardt, K., Gopal, C., & Lerman, K. (2017). Dynamics of content quality in collaborative knowledge production. In Proceedings of 11th AAAI International Conference on Web and Social Media. AAAI
Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051), 1878–1881.
Hodas, N.O.. & Lerman, K. (2012). How limited visibility and divided attention constrain social contagion. In ASE/IEEE International Conference on Social Computing
Hodas, N.O., & Lerman, K. (2014). The simple rules of social contagion. Scientific Reports, 4, 4343.
Hogg, T., & Lerman, K. (2012). Social dynamics of digg. EPJ Data Science, 1(1), 5.
Kleinberg, J., Himabindu, L., Jure, L. Jens, L., & Sendhil, M. (2017). Human decisions and machine predictions. National Bureau of Economic Research: Technical report.
Kooti, F., Lerman, K., Aiello, L.M., Grbovic, M., Djuric, N., & Radosavljevic, V. (2016). Portrait of an online shopper: Understanding and predicting consumer behavior. In The 9th ACM International Conference on Web Search and Data Mining
Kooti, F., Subbian, K., Mason, W., Adamic, L., & Lerman, K. (2017). Understanding short-term changes in online activity sessions. In Proceedings of the 26th International World Wide Web Conference (Companion WWW2017)
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., et al. (2009). Computational social science. Science, 323, 721–723.
Lerman, K. (2016). Information is not a virus, and other consequences of human cognitive limits. Future Internet, 8(2), 21+.
McFarland, D. A., Lewis, K., & Goldberg, A. (2016). Sociology in the era of big data: The ascent of forensic social science. The American Sociologist, 47(1), 12–35.
Norton, J.H., & Divine, G. (2015). Simpson’s paradox... and how to avoid it. Significance, 12(4), 40–43.
Rodriguez, M.G., Gummadi, K., Schoelkopf, B. (2014). Quantifying information overload in social media and its impact on social contagions. In Proceedings of Eighth International AAAI Conference on Weblogs and Social Media
Romero, D.M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In Proceedings of the 20th International Conference on World Wide Web (pp. 695–704), New York, NY, USA: ACM.
Singer, P., Ferrara, E., Kooti, F., Strohmaier, M., & Lerman, K. (2016). Evidence of online performance deterioration in user sessions on reddit. PLoS ONE, 11(8), e0161636+.
Ver Steeg, G., Ghosh, R., & Lerman, K. (2011). What stops social epidemics? In Proceedings of 5th International Conference on Weblogs and Social Media. AAAI
Vaupel, J. W., & Yashin, A. I. (1985). Heterogeneity’s ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3), 176–185.
Many people have contributed along the way to identifying the problem of Simpson’s paradox in data analysis, investigating it empirically, as well as devising methods to mitigate its effects. These people include Nathan Hodas, Farshad Kooti, Keith Burghardt, Philipp Singer, Emilio Ferrara, Peter Fennell, Nazanin Alipourfard. This work was funded, in part, by Army Research Office under contract W911NF-15-1-0142.
About this article
Cite this article
Lerman, K. Computational social scientist beware: Simpson’s paradox in behavioral data. J Comput Soc Sc 1, 49–58 (2018). https://doi.org/10.1007/s42001-017-0007-4
- Simpson’s paradox
- Survivorship bias
- Ecological fallacy