Abstract
The rise of big data—data that are not only large and massively multivariate but concern a dizzying array of phenomena—represents a watershed moment for the social sciences. These data have created demand for new methods that reduce/simplify the dimensionality of data, identify novel patterns and relations, and predict outcomes, from computational ethnography and computational linguistics to network science, machine learning, and in situ experiments. Such developments have led scholars to begin new lines of social inquiry. Company engineers, computer scientists, and social scientists have all converged on big data, creating the possibility of a vibrant “trading zone” for collaboration. However, strong differences in research frameworks help explain why big data may not be an egalitarian trading zone across fields, but rather—at least in the short term—a moment when engineering colonizes sociology more than vice versa. In the long term, however, we suggest there may be the possibility of a constructive synthesis across paradigms in what we term ‘forensic social science.’
Similar content being viewed by others
Notes
This approach also had an elective affinity with certain social scientific theories over others: for instance, rational choice theory (Coleman 1994a) was arguably more readily translated into the data and methods of the time, than say the more abstract theories that preceded it, such as structural functionalism.
However, we are not necessarily living an age of science. By this we mean that we are black-boxing information and knowledge in tools and treatments that bring about desired outcomes without ever really understanding why or how they do so. While in the past, scientific facts were black-boxed due to their complexity (Latour 1988), we now find ourselves in a time when we often seek solutions without any concern or desire for explanation at all. Not everyone wants this; but rather, the prevailing pressures of industry, engineering, and practical concerns of life in a technology-mediated age demand this. Scientists will still seek explanations—but their voices may remain an (increasingly small) numerical minority.
This feature of big data, in particular, is a curse as much as a blessing. Several years ago, Lazer et al. questioned whether “computational social science could become the exclusive domain of private companies and government agencies” (2009:721). Equally problematic as questions of ownership and access, however, are related concerns about data quality and interpretation: in eliminating the participation of the academic researcher, we also eliminate the guiding force that orients data collection towards the pursuit of knowledge rather than the maximization of profit. The data that are collected in industry are not always the data that are most useful for science (as we elaborate in great detail below); worse, too seldom acknowledged is the basic observation that technologies constrain as much as they enable—and so any given dataset may tell us less about human agency and more about interfaces and algorithms that subtly influence user behavior (cf. Lewis 2015).
An additional dimension to these new types of data relates to behaviors that are made possible by digital intermediation and hitherto did not exist. For example, in pre-internet times, people were simply technically unable to share photos on the scale and frequency they do today. These types of technologically enabled social transactions are a specific category of behaviors, some of which may (or may not) significantly affect social dynamics and structures. Though the same technological advances that make big data possible enable these new categories of data, these advances are, in principle, no different from previous technological and ideational transformations that catalyzed social change (such as the invention of the printing press or the emergence of the formal organization). In that respect, data generated on digitally-mediated platforms that represent new categories of social action are no different from other phenomena of sociological interest.
For our part, we believe the linking of multiple corpora for an entire domain will bring the greatest advances to the social sciences. With rich, multifaceted data for an entire social system—of, say, politics, a market, or academe—we can ask and answer a variety of social science questions with less concern of confounding, missing data, and selection bias (see Coleman 1994b).
A note of caution is in order here: by no concern for theory, we are referring only to theories that relate to explaining the social phenomenon in question. There are, of course, multiple statistical assumptions, informed by theory, that are embodied in the data-mining algorithms being employed.
“Novel” often amounts to “more comprehensive” as system-wide, societal-wide, and even planetary-wide data are increasingly available (e.g. Leskovec and Horvitz 2008).
Naturally, conclusions from big data will always require qualification insofar as 1) the “digital divide” persists and 2) usage patterns of a given technology are differentiated even among those who can access it (see Lewis in press). Nonetheless, digital—and especially mobile—communications technologies are diffusing at a staggering rate (e.g. Castells et al. 2007); many population-level datasets are compiled by government or other record-keeping organizations and inclusion is not biased by “self-selection”; and given our unprecedented reliance on technology for communication, information retrieval, and relationship formation and maintenance (Bohn et al. 2014; Rosenfeld and Thomas 2012; Sparrow et al. 2011), the sheer size of available data and the proportion of humanity to whom it pertains is staggering.
The term “paradigm” may be too strong for the social science disciplines, as they often lack a shared set of standards and questions. In fact, Thomas Kuhn regarded them as pre-paradigmatic (1996). That said, we maintain it is still reasonable to regard social science and engineering as entailing different research frameworks or distinct gestalts of epistemology and research activity.
Distinct kinds of training and opportunities for employment are also important, but brevity requires this to remain a caricature of the two perspectives. We merely intend for the reader to grasp these differences on an intuitive level, so she can see incommensurability as one reason for less reciprocal forms of exchange in the trading zone of big data.
Here is an example of where the study of online dating sites by engineers reveal which odd questions differentiate tastes (http://blog.okcupid.com/index.php/the-best-questions-for-first-dates/).
References
Abbott, A. (1988). Transcending general linear reality. Sociological Theory, 6(2), 169–86.
Agresti, A., & Finlay, B. (2009). Statistical methods for the social sciences (4th ed.). Upper Saddle River: Prentice Hall.
Alpaydin, E. (2004). Introduction to machine learning. Cambridge: MIT Press.
Anand, G. (2010). A weird way of thinking has prevailed worldwide. New York Times (August 25, 2010).
Anderson, M. J. (1988). The American census: a social history. New York: Yale University Press.
Anderson, A., McFarland, D. A., & Jurafsky, D. (2012). Towards a computational history of the ACL: 1980–2008. Association of Computational Linguistics, Workshop (ACL Workshop 2012).
Backstrom, L., Kleinberg, J., Lee, L., & Danescu-Niculescu-Mizil, C. (2013). Characterizing and curating conversation threads: expansion, focus, volume, re-entry. Proceedings of WSDM, 2013.
Bail, C. A. (2014). The cultural environment: measuring culture with big data. Theory and Society, 43, 465–482.
Barabasi, A. (2003). Linked: How everything is connected to everything else and what it means for business, science, and everyday life. New York: Plume.
Bender-deMoll, S., & McFarland, D. A. (2006). The art and science of dynamic network visualization. Journal of Social Structure, 7(2).
Berger, P., & Luckmann, T. (1966). The social construction of reality: a treatise in the sociology of knowledge. New York: Anchor.
Bishop, C. (2007). Pattern recognition and machine learning (information science and statistics). Cambridge: Springer.
Blei, D. (2012). Probabilistic topic models. Review article, Communication of the ACM, 55(4), 77–84.
Bohn, A., Buchta, C., Hornik, K., & Mair, P. (2014). Making friends and communicating on facebook: implications for the access to social capital. Social Networks, 37, 29–41.
Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences. Science, 323, 892–95.
Boyd, D., & Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15, 662–79.
Brandes, U., Robins, G., McCranie, A., & Wasserman, S. (2013). What is network science? Network Science, 1, 1–15.
Brown, J. S., & Duguid, P. (2002). The social life of information. Harvard Business Review Press
Bruch, E. E., & Mare, R. D. (2012). Methodological issues in the analysis of residential preferences and residential mobility. Sociological Methodology, 42, 103–54.
Camic, C., & Xie, Y. (1994). The statistical turn in American social science: Columbia University, 1890 to 1915. American Sociological Review, 59(5), 773–805.
Castells, M., Fernández-Ardèvol, M., Qiu, J. L., & Sey, A. (2007). Mobile communication and society: a global perspective. Cambridge: MIT Press.
Centola, D. (2010). The spread of behavior in an online social network experiment. Science, 329(5996), 1194–97.
Coleman, J. S. (1986). Social theory, social research, and a theory of action. American Journal of Sociology, 91(6), 1309–35.
Coleman, J. S. (1994a). Foundations of social theory. Cambridge: Belknap Press.
Coleman, J. S. (1994b). A vision for sociology. Society, 30, 29–34.
Collins, H., Evans, R., & Gorman, M. (2007). Trading zones and interactional expertise. Studies in History and Philosophy of Science, 38(4), 657–66.
Converse, J. M. (1987). Survey research in the United States: roots and emergence 1890–1960. Berkeley: University of California Press.
Cukier, K., & Mayer-Schoenberge, V. (2013). The rise of big data: how it’s changing the way we think about the world. Foreign Affairs, 28–41.
Diehl, D., & McFarland, D. A. (2010). Towards a historical sociology of situations. American Journal of Sociology, 115(6), 1713–52.
Dodds, P. S., Muhamad, R., & Watts, D. (2003). An experimental study of search in global social networks. Science, 301(5634), 827–9.
Easley, D., & Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge: Cambridge University Press.
Einav, L., Levin, J., Popov, I., & Sundaresan, N. (2014). Growth, adoption and use of mobile e-commerce. American Economic Review: Papers and Proceedings, 104(5), 489–94.
Fleck, L. (1979). Genesis and development of a scientific fact. Chicago: University of Chicago Press.
Galison, P. (1997). Image and logic: a material culture of microphysics. Chicago: University of Chicago Press.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: strategies for qualitative research. Chicago: Aldine Pub. Co.
Goldberg, A. (in press). In defense of forensic social science. Big Data & Society.
Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep and daylength across diverse cultures. Science, 333(6051), 1878–81.
Golder, S. A., & Macy, M. W. (2014). Digital footprints: opportunities and challenges for online social research. Annual Review of Sociology, 40, 129–52.
González-Bailón, S., Borge-Holthoeter, J., Rivero, A., & Moreno, Y. (2011). The dynamics of protest recruitment through an online network. Scientific Reports, 1, 197.
González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27.
Grimmer, J., Westwood, S. J., & Messing, S. (2014). The impression of influence: legislator communication, representation, and democratic accountability. Princeton: Princeton University Press.
Hacking, I. (2006). The emergence of probability. Cambridge: Cambridge University Press.
Hilbert, M., & López, P. (2011). The world’s technological capacity to store, communicate, and compute information. Science, 332(6025), 60–5.
Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. New York: Prentice Hall.
Kagan, J. (2009). The three cultures: natural sciences, social sciences, and the humanities in the 21st century. New York: Cambridge University Press.
Kirchner, C., & Mohr, J. W. (2010). “Meanings and relations: an introduction to the study of language, discourse, and networks.”. Poetics, 38(6), 555–66.
Kohavi, R., & Longbotham, R. (2007). Online experiments: lessons learned. Computer, 40(9), 103–5.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.
Kuhn, T. S. (1996). The structure of scientific revolutions (3rd ed.). Chicago: University of Chicago Press.
Latour, B. (1988). Science in action. Cambirdge: Harvard University Press.
Latour, B., & Woolgar, S. (1986). Laboratory life: the construction of scientific facts (2nd ed.). Princeton: Princeton University Press.
Laumann, E. O., Marsden, P., & Prensky, D. (1983). “The boundary specification problem in network analysis.”. In R. S. Burt & M. J. Minor (Eds.), Applied network analysis: A methodological introduction. London: Sage Publications.
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Alstyne, M. V. (2009). Computational social science. Science, 323(5915), 721–3.
Leskovec, J., & Horvitz, E. (2008). Planetary-scale views on a large instant-messaging network. International World Wide Web Conference (WWW).
Leskovec, J., Lang, K., & Mahoney, M. (2010). Empirical comparison of algorithms for network community detection. In WWW ’10: Proceedings of the 19th International Conference on World Wide Web. New York: ACM.
Levine, D. N. (1995). Visions of the sociological tradition. Chicago: University of Chicago Press.
Lewis, K. (2015). Studying online behavior: comment on Anderson et al. 2014. Sociological Science, 2, 20–31.
Lewis, K. (in press). Three fallacies of digital footprints. Big Data & Society.
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., & Christakis, N. (2008). Tastes, ties, and time: a new social network dataset using facebook.com. Social Networks, 30(4), 330–42.
Lohr, S. (2012). “The Age of Big Data.” New York Times (February 11, 2012)
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
McCallum, A., Corrada-Emmanuel, A., & Wang, X. (2005). Topic and role discovery in social networks. IJCAI (International Joint Conferences on Artificial Intelligence).
McFarland, D.A. and H.R. McFarland. (in press). Big data and the danger of being precisely inaccurate. Big Data & Society.
McFarland, D. A., Diehl, D., & Rawlings, C. (2011). “Methodological transactionalism and the sociology of education.”. In H. Maureen (Ed.), Chapter 5 in Frontiers in sociology of education (pp. 87–109). New York: Springer.
McFarland, D. A., Manning, C. D., Ramage, D., Chuang, J., Heer, J., & Jurafsky, D. (2013a). Differentiating language usage through topic models. Poetics, 41(6), 607–25.
McFarland, D. A., Jurafsky, D., & Rawlings, C. (2013b). Making the connection: social bonding in courtship situations. American Journal of Sociology, 118(6), 1596–1649.
Menand, L. (2010). The marketplace of ideas: issues of our time. New York: W.W. Norton & Company.
National Research Council. (2014). Convergence: Facilitating transdisciplinary integration of life sciences, physical sciences, engineering and beyond. National Research Council.
Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98, 404–409.
Newman, M. E. J. (2009). Networks: an introduction. Oxford: Oxford University Press.
Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69, 026113.
Pentland, A. (2014). Social physics: How good ideas spread--the lessons from a new science. New York: Penguin Press.
Platt, J. (1996). A history of sociological research methods in America, 1920–1960. Cambridge: Cambridge University Press.
Porter, T. M. (1995). Trust in numbers. Princeton: Princeton University Press.
Porter, T. M., & Ross, D. (Eds.). (2003). The modern social sciences. New York: Cambridge University Press.
Ranganath, R., Jurafsky, D., & McFarland, D. A. (2012). Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language, 27(1), 89–115.
Rogers, E. M. (1987). Progress, problems and prospects for network research: investigating relationships in the age of electronic communication technologies. Social Networks, 9, 285–310.
Rosenfeld, M. J., & Thomas, R. J. (2012). Searching for a mate: the rise of the internet as a social intermediary. American Sociological Review, 77(4), 523–47.
Salganik, M. J., Dodds, P. S., & Watts, D. J. (2006). Experimental study of inequality and unpredictability in an artificial cultural market. Science, 311, 854–6.
Shi, X., Leskovec, J., & McFarland, D. A. (2010). Citing for high impact. Joint Conference on Digital Libraries, (JCDL 2010).
Shwed, U., & Bearman, P. S. (2010). The temporal structure of scientific consensus formation. American Sociological Review, 75(6), 817–40.
Smith, A., & Duggan, M. (2013). Online dating & relationships. Washington: Pew Research Center.
Snow, C. P. (2001). The two cultures. London: Cambridge University Press. 1959.
Sparrow, B., Liu, J., & Wegner, D. M. (2011). Google effects on memory: cognitive consequences of having information at our fingertips. Science, 333, 776–8.
Stokes, D. E. (1997). Pasteur’s quadrant: basic science and technological innovation. Washington: Brookings Institution Press.
Stouffer, S. A. (1949). In The American Soldier, 4 vols Studies in social psychology during World War II.. Princeton, NJ: Princeton University Press.
Szell, M., & Thurner, S. (2010). Measuring social dynamics in a massive multiplayer online game. Social Networks, 32, 313–29.
Talley, E., Newman, D., Herr, B., II, Wallach, H., Burns, G., Leenders, M., & McCallum, A. (2011). A database of national institutes of health (NIH) research using machine learned categories and graphically clustered grant awards. Nature Methods, 8, 443–4.
Vaisey, S. (2009). Motivation and justification: a dual-process model of culture in action. American Journal of Sociology, 114, 1675–1715.
Vaughan, D. (2014). Analogy, cases, and comparative social organization. In R. Swedberg (Ed.), Theorizing in social science: the context of discovery (pp. 61–84). Stanford: Stanford University Press.
Wang, D. J., Shi, X., McFarland, D. A., & Leskovec, J. (2012). Measurement error in social network data: a re-classification. Social Networks, 34(4), 396–409.
Wasserman, S., & Faust, K. (1994). Social network analysis: methods and applications. Cambridge: Cambridge University Press.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper has benefited from conversations with Jure Leskovec and Kylie Swall who helped us formulate some of this material. This paper also benefitted from the feedback of Ron Breiger and John Mohr at the annual conference of the American Sociological Association in 2014. The work presented here is in part supported by Stanford’s Center for Computational Social Science and NSF # 0835614.
Rights and permissions
About this article
Cite this article
McFarland, D.A., Lewis, K. & Goldberg, A. Sociology in the Era of Big Data: The Ascent of Forensic Social Science. Am Soc 47, 12–35 (2016). https://doi.org/10.1007/s12108-015-9291-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12108-015-9291-8