Web scraping, defined as the automated extraction of information online, is an increasingly important means of producing data in the social sciences. We contribute to emerging social science literature on computational methods by elaborating on web scraping as a means of automated access to information. We begin by situating the practice of web scraping in context, providing an overview of how it works and how it compares to other methods in the social sciences. Next, we assess the benefits and challenges of scraping as a technique of information production. In terms of benefits, we highlight how scraping can help researchers answer new questions, supersede limits in official data, overcome access hurdles, and reinvigorate the values of sharing, openness, and trust in the social sciences. In terms of challenges, we discuss three: technical, legal, and ethical. By adopting “algorithmic thinking in the public interest” as a way of navigating these hurdles, researchers can improve the state of access to information on the Internet while also contributing to scholarly discussions about the legality and ethics of web scraping. Example software accompanying this article are available within the supplementary materials.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Other common monikers for data scraping include web scraping, screen scraping, web data extraction, web harvesting, and data harvesting. There are technical differences between the concepts of data “scraping” and website “crawling”. A crawler is a bot that will navigate to a website for the purpose of indexing (i.e. record keywords and metadata) and then navigating to other websites via the links on that page. A scraper is a bot designed with the explicit intent on navigating and extracting specific information from one or multiple target websites. For the sake of simplicity, we conflate the two concepts here. Where differences exist, the two are contrasted in-text.
Within the Supplementary Materials, we exemplify the combined use of several scraping libraries to achieve increasingly complex automated data extraction.
We thank the anonymous reviewer for this point.
Abercrombie, G., Batista-Navarro, R.: Sentiment and position-taking analysis of parliamentary debates: a systematic literature review. J. Comput. Soc. Sci. 3, 245–270 (2020)
Ackland, R.: Web social science: concepts, data and tools for social scientists in the digital age. Sage, Thousand Oaks (2013)
Allington, D.: Linguistic capital and development capital in a network of cultural producers: mutually valuing peer groups in the ‘interactive fiction’ retrogaming scene. Cult. Sociol. 10(2), 267–286 (2016)
Anglin, K.L.: Gather-narrow-extract: a framework for studying local policy variation using web-scraping and natural language processing. J. Res. Educ. Eff. 12(4), 685–706 (2019)
Bancroft, A.: Research in fractured digital spaces. Int. J. Drug Policy 73, 288–292 (2019)
Boeing, G., Waddell, P.: New insights into rental housing markets across the united states: web scraping and analyzing craigslist rental listings. J. Plan. Educ. Res. 37(4), 457–476 (2017)
Braun, M.T., Kuljanin, G., DeShon, R.P.: Special considerations for the acquisition and wrangling of big data. Organ. Res. Methods 21(3), 633–659 (2018)
Burrows, R., Savage, M.: After the crisis? Big data and the methodological challenges of empirical sociology. Big Data Soc. 1(1), 2053951714540280 (2014)
Caruana-Galizia, P., Caruana-Galizia, M.: Political land corruption: evidence from Malta-the European union’s smallest member state. J. Public Policy 38(4), 419–453 (2018)
Cavallo, A.: Scraped data and sticky prices. Rev. Econ. Stat. 100(1), 105–119 (2018)
Cesare, N., Lee, H., McCormick, T., Spiro, E., Zagheni, E.: Promises and pitfalls of using digital traces for demographic research. Demography 55(5), 1979–1999 (2018)
Dewi, L.C., Chandra, A., et al.: Social media web scraping using social media developers api and regex. Procedia Comput. Sci. 157, 444–449 (2019)
Dick, K., Charih, F., Woo, J., Green, J.R.: Gas prices of America: the machine-augmented crowd-sourcing era. In: 2020 17th Conference on Computer and Robot Vision (CRV), pp. 158–165. IEEE (2020)
Din, M.F.: Breaching and entering: when data scraping should be a federal computer hacking crime. Brooklyn Law Rev. 81, 405 (2015)
Drivas, I.: Liability for data scraping prohibitions under the refusal to deal doctrine. Univ. Chic. Law Rev. 86(7), 1901–1940 (2019)
Edelmann, A., Wolff, T., Montagne, D., Bail, C.A.: Computational social science and sociology. Ann. Rev. Sociol. 46, 61–81 (2020)
Edwards, A., Housley, W., Williams, M., Sloan, L., Williams, M.: Digital social research, social media and the sociological imagination: surrogacy, augmentation and re-orientation. Int. J. Soc. Res. Methodol. 16(3), 245–260 (2013)
Fazekas, M., Tóth, I.J.: From corruption to state capture: a new analytical framework with empirical applications from Hungary. Polit. Res. Q. 69(2), 320–334 (2016)
Felderer, B., Blom, A.G.: Acceptance of the automated online collection of geographical information. Sociol. Methods Res. 0049124119882480 (2019)
Flisfeder, M.: Algorithmic Desire: Toward a New Structuralist Theory of Social Media. Northwestern University Press, Evanston (2021)
Futschek, G.: Algorithmic thinking: the key for understanding computer science. In: International Conference on Informatics in Secondary Schools-Evolution and Perspectives. Springer, pp. 159–168 (2006)
Galliher, J.F.: Social scientists’ ethical responsibilties to superordinates: looking upward meekly. Soc. Probl. 27, 298 (1979)
Golder, S.A., Macy, M.W.: Digital footprints: opportunities and challenges for online social research. Ann. Rev. Sociol. 40, 129–152 (2014)
Green, B., Viljoen, S.: Algorithmic realism: expanding the boundaries of algorithmic thought. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 19–31 (2020)
Gregory, K.: Online communication settings and the qualitative research process: acclimating students and novice researchers. Qual. Health Res. 28(10), 1610–1620 (2018)
Grimmer, J.: A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)
Haggerty, K.D.: Ethics creep: governing social science research in the name of ethics. Qual. Sociol. 27(4), 391–414 (2004)
Hampton, K.N.: Studying the digital: directions and challenges for digital methods. Ann. Rev. Sociol. 43, 167–188 (2017)
Hayes, A.L., Scott, T.A.: Multiplex network analysis for complex governance systems using surveys and online behavior. Policy Stud. J. 46(2), 327–353 (2018)
Keuschnigg, M., Lovsjö, N., Hedström, P.: Analytical sociology and computational social science. J. Comput. Soc. Sci. 1(1), 3–14 (2018)
Landers, R.N., Brusso, R.C., Cavanaugh, K.J., Collmus, A.B.: A primer on theory-driven web scraping: automatic extraction of big data from the internet for use in psychological research. Psychol. Methods 21(4), 475 (2016)
Lazer, D., Radford, J.: Data ex machina: introduction to big data. Ann. Rev. Sociol. 43, 19–39 (2017)
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., et al.: Computational social science. Science (New York, NY) 323(5915), 721–723 (2009)
Li, F., Zhou, Y, Cai, T.: Trails of data: Three cases for collecting web information for social science research. Soc. Sci. Comput. Rev. (OnlineFirst) (2019)
Lin, M., Lucas, H.C., Jr., Shmueli, G.: Research commentary-too big to fail: large samples and the p-value problem. Inf. Syst. Res. 24(4), 906–917 (2013)
Luscombe, A., Walby, K.: Theorizing freedom of information: the live archive, obfuscation, and actor-network theory. Gov. Inf. Q. 34(3), 379–387 (2017)
Maher, T.V., Seguin, C., Zhang, Y., Davis, A.P.: Social scientists’ testimony before congress in the united states between 1946–2016, trends from a new dataset. PLoS ONE 15(3), e0230104 (2020)
Marres, N., Weltevrede, E.: Scraping the social? Issues in live social research. J. Cult. Econ. 6(3), 313–335 (2013)
Massimino, B.: Accessing online data: web-crawling and information-scraping techniques to automate the assembly of research data. J. Bus. Logist. 37(1), 34–42 (2016)
Mausolf, J.G.: Occupy the government: analyzing presidential and congressional discursive response to movement repression. Soc. Sci. Res. 67, 91–114 (2017)
McFarland, D.A., McFarland, H.R.: Big data and the danger of being precisely inaccurate. Big Data Soc. 2(2), 2053951715602495 (2015)
McFarland, D.A., Lewis, K., Goldberg, A.: Sociology in the era of big data: the ascent of forensic social science. Am. Sociol. 47(1), 12–35 (2016)
Millington, B., Millington, R.: ‘The datafication of everything’: toward a sociology of sport and big data. Sociol. Sport J. 32(2), 140–160 (2015)
Mitchell, R.: Web Scraping with Python: Collecting More Data from the Modern Web. O’Reilly Media, Newton (2018)
Munzert, S., Rubba, C., Meißner, P., Nyhuis, D.: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Wiley, Hoboken (2014)
Nader, L.: Up the anthropologist: perspectives gained from ‘studying up’. In: Hymes, D. (ed.) Reinventing Anthropology, pp. 284–311. Random House, New York (1968)
Nelson, L.K.: Computational grounded theory: a methodological framework. Sociol. Methods Res. 49(1), 3–42 (2020)
Nelson, L.K., Burk, D., Knudsen, M., McCall, L.: The future of coding: a comparison of hand-coding and three types of computer-assisted text analysis methods. Sociol. Methods Res. 50(1), 202–237 (2021)
Nisser, A., Weidmann, N.B.: Online ethnic segregation in a post-conflict setting. Eur. J. Commun. 33(5), 489–504 (2018)
Olmedilla, M., Martínez-Torres, M.R., Toral, S.: Harvesting big data in social science: a methodological approach for collecting online user-generated content. Comput. Stand. Interfaces 46, 79–87 (2016)
Pina-Sánchez, J., Grech, D., Brunton-Smith, I., Sferopoulos, D.: Exploring the origin of sentencing disparities in the crown court: using text mining techniques to differentiate between court and judge disparities. Soc. Sci. Res. 84, 102343 (2019)
Pina-Sánchez, J., Julian, V.R., Sferopoulos, D.: Does the crown court discriminate against Muslim-named offenders? A novel investigation based on text mining techniques. Br. J. Criminol. 59(3), 718–736 (2019a)
Possamai-Inesedy, A., Nixon, A.: A place to stand: digital sociology and the Archimedean effect. J. Sociol. 53(4), 865–884 (2017)
Possler, D., Bruns, S., Niemann-Lenz, J.: Data is the new oil-but how do we drill it? Pathways to access and acquire large data sets in communication science. Int. J. Commun. 13, 3894–3911 (2019)
Qiu, L., Chan, S.H.M., Chan, D.: Big data in social and psychological science: theoretical and methodological issues. J. Comput. Soc. Sci. 1(1), 59–66 (2018)
Ravn, S., Barnwell, A., Barbosa Neves, B.: What is “publicly available data”? Exploring blurred public-private boundaries and ethical practices through a case study on Instagram. J. Empir. Res. Hum. Res. Ethics 15(1–2), 40–45 (2020)
Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Albertson, B., Gadarian, S., Rand, D.: Topic models for open ended survey responses with applications to experiments. Am. J. Polit. Sci. 58, 1064–82 (2014)
Salganik, M.J.: Bit by bit: social research in the digital age. Princeton University Press, Princeton (2019)
Savage, M., Burrows, R.: The coming crisis of empirical sociology. Sociology 41(5), 885–899 (2007)
Scassa, T.: Ownership and control over publicly accessible platform data. Online Inf. Rev. 43(6), 986–1002 (2019)
Schwartz, H.A., Ungar, L.H.: Data-driven content analysis of social media: a systematic overview of automated methods. Ann. Am. Acad. Pol. Soc. Sci. 659(1), 78–94 (2015)
Shi, F., Shi, Y., Dokshin, F.A., Evans, J.A., Macy, M.W.: Millions of online book co-purchases reveal partisan differences in the consumption of science. Nat. Hum. Behav. 1(4), 1–9 (2017)
Stoltz, D.S., Taylor, M.A.: Concept mover’s distance: measuring concept engagement via word embeddings in texts. J. Comput. Soc. Sci. 2(2), 293–313 (2019)
Sugiura, L., Wiles, R., Pope, C.: Ethical challenges in online research: public/private perceptions. Res. Ethics 13(3–4), 184–199 (2017)
Tracy, S.J.: Qualitative quality: eight “big-tent” criteria for excellent qualitative research. Qual. Inq. 16(10), 837–851 (2010)
Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:14037400 (2014)
Tzanetakis, M.: Comparing cryptomarkets for drugs. A characterisation of sellers and buyers over time. Int. J. Drug Policy 56, 176–186 (2018)
Ulbricht, L.: Scraping the demos. Digitalization, web scraping and the democratic project. Democratization 27(3), 426–442 (2020)
Von Krogh, G., Von Hippel, E.: The promise of research on open source software. Manag. Sci. 52(7), 975–983 (2006)
Conflict of interest
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Below is the link to the electronic supplementary material.
About this article
Cite this article
Luscombe, A., Dick, K. & Walby, K. Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences. Qual Quant (2021). https://doi.org/10.1007/s11135-021-01164-0
- Web scraping
- Digital methods
- Algorithmic thinking
- Access to information
- Social science research