Computational social scientist beware: Simpson’s paradox in behavioral data

Abstract

Observational data about human behavior are often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson’s paradox, whereby the trends observed in data that have been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson’s paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson’s paradox is affecting results of analysis. The presence of Simpson’s paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies’ findings.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. 1.

    Alipourfard, N., Fennell, P., & Lerman, K. (2018). Can you trust the trend? Discovering Simpson’s paradoxes in social data. In Proceedings of the 11th International ACM Conference on Web Search and Data Mining. ACM

  2. 2.

    Barbosa, S., Cosley, D., Sharma, A., & Cesar, R.M., Jr. (2016) Averaging gone wrong: Using time-aware analyses to better understand behavior. In Proceedings of the World Wide Web Conference (pp. 829–841), April 2016.

  3. 3.

    Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from berkeley. Science, 187(4175), 398–404.

    Article  Google Scholar 

  4. 4.

    Blyth, C. R. (1972). On simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338), 364–366.

    Article  Google Scholar 

  5. 5.

    Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D., Marlow, C., Settle, J. E., et al. (2012). A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415), 295–298.

    Article  Google Scholar 

  6. 6.

    Fabris, C., & Freitas, A. (2000). Discovering surprising patterns by detecting occurrences of simpson’s paradox. In M. Bramer, A. Macintosh, & F. Coenen (Eds.), Research and Development in Intelligent Systems XVI (pp. 148–160). London: Springer

  7. 7.

    Ferrara, E., Alipourfard, N., Burghardt, K., Gopal, C., & Lerman, K. (2017). Dynamics of content quality in collaborative knowledge production. In Proceedings of 11th AAAI International Conference on Web and Social Media. AAAI

  8. 8.

    Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051), 1878–1881.

    Article  Google Scholar 

  9. 9.

    Hodas, N.O.. & Lerman, K. (2012). How limited visibility and divided attention constrain social contagion. In ASE/IEEE International Conference on Social Computing

  10. 10.

    Hodas, N.O., & Lerman, K. (2014). The simple rules of social contagion. Scientific Reports, 4, 4343.

  11. 11.

    Hogg, T., & Lerman, K. (2012). Social dynamics of digg. EPJ Data Science, 1(1), 5.

    Article  Google Scholar 

  12. 12.

    Kleinberg, J., Himabindu, L., Jure, L. Jens, L., & Sendhil, M. (2017). Human decisions and machine predictions. National Bureau of Economic Research: Technical report.

  13. 13.

    Kooti, F., Lerman, K., Aiello, L.M., Grbovic, M., Djuric, N., & Radosavljevic, V. (2016). Portrait of an online shopper: Understanding and predicting consumer behavior. In The 9th ACM International Conference on Web Search and Data Mining

  14. 14.

    Kooti, F., Subbian, K., Mason, W., Adamic, L., & Lerman, K. (2017). Understanding short-term changes in online activity sessions. In Proceedings of the 26th International World Wide Web Conference (Companion WWW2017)

  15. 15.

    Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., et al. (2009). Computational social science. Science, 323, 721–723.

    Article  Google Scholar 

  16. 16.

    Lerman, K. (2016). Information is not a virus, and other consequences of human cognitive limits. Future Internet, 8(2), 21+.

    Article  Google Scholar 

  17. 17.

    McFarland, D. A., Lewis, K., & Goldberg, A. (2016). Sociology in the era of big data: The ascent of forensic social science. The American Sociologist, 47(1), 12–35.

    Article  Google Scholar 

  18. 18.

    Norton, J.H., & Divine, G. (2015). Simpson’s paradox... and how to avoid it. Significance, 12(4), 40–43.

  19. 19.

    Rodriguez, M.G., Gummadi, K., Schoelkopf, B. (2014). Quantifying information overload in social media and its impact on social contagions. In Proceedings of Eighth International AAAI Conference on Weblogs and Social Media

  20. 20.

    Romero, D.M., Meeder, B., & Kleinberg, J. (2011). Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In Proceedings of the 20th International Conference on World Wide Web (pp. 695–704), New York, NY, USA: ACM.

  21. 21.

    Singer, P., Ferrara, E., Kooti, F., Strohmaier, M., & Lerman, K. (2016). Evidence of online performance deterioration in user sessions on reddit. PLoS ONE, 11(8), e0161636+.

    Article  Google Scholar 

  22. 22.

    Ver Steeg, G., Ghosh, R., & Lerman, K. (2011). What stops social epidemics? In Proceedings of 5th International Conference on Weblogs and Social Media. AAAI

  23. 23.

    Vaupel, J. W., & Yashin, A. I. (1985). Heterogeneity’s ruses: some surprising effects of selection on population dynamics. The American Statistician, 39(3), 176–185.

    Google Scholar 

Download references

Acknowledgements

Many people have contributed along the way to identifying the problem of Simpson’s paradox in data analysis, investigating it empirically, as well as devising methods to mitigate its effects. These people include Nathan Hodas, Farshad Kooti, Keith Burghardt, Philipp Singer, Emilio Ferrara, Peter Fennell, Nazanin Alipourfard. This work was funded, in part, by Army Research Office under contract W911NF-15-1-0142.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kristina Lerman.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lerman, K. Computational social scientist beware: Simpson’s paradox in behavioral data. J Comput Soc Sc 1, 49–58 (2018). https://doi.org/10.1007/s42001-017-0007-4

Download citation

Keywords

  • Statistics
  • Simpson’s paradox
  • Survivorship bias
  • Ecological fallacy
  • Heterogeneity