Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences

Abstract

Web scraping, defined as the automated extraction of information online, is an increasingly important means of producing data in the social sciences. We contribute to emerging social science literature on computational methods by elaborating on web scraping as a means of automated access to information. We begin by situating the practice of web scraping in context, providing an overview of how it works and how it compares to other methods in the social sciences. Next, we assess the benefits and challenges of scraping as a technique of information production. In terms of benefits, we highlight how scraping can help researchers answer new questions, supersede limits in official data, overcome access hurdles, and reinvigorate the values of sharing, openness, and trust in the social sciences. In terms of challenges, we discuss three: technical, legal, and ethical. By adopting “algorithmic thinking in the public interest” as a way of navigating these hurdles, researchers can improve the state of access to information on the Internet while also contributing to scholarly discussions about the legality and ethics of web scraping. Example software accompanying this article are available within the supplementary materials.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    Other common monikers for data scraping include web scraping, screen scraping, web data extraction, web harvesting, and data harvesting. There are technical differences between the concepts of data “scraping” and website “crawling”. A crawler is a bot that will navigate to a website for the purpose of indexing (i.e. record keywords and metadata) and then navigating to other websites via the links on that page. A scraper is a bot designed with the explicit intent on navigating and extracting specific information from one or multiple target websites. For the sake of simplicity, we conflate the two concepts here. Where differences exist, the two are contrasted in-text.

  2. 2.

    Within the Supplementary Materials, we exemplify the combined use of several scraping libraries to achieve increasingly complex automated data extraction.

  3. 3.

    We thank the anonymous reviewer for this point.

References

  1. Abercrombie, G., Batista-Navarro, R.: Sentiment and position-taking analysis of parliamentary debates: a systematic literature review. J. Comput. Soc. Sci. 3, 245–270 (2020)

    Article  Google Scholar 

  2. Ackland, R.: Web social science: concepts, data and tools for social scientists in the digital age. Sage, Thousand Oaks (2013)

    Google Scholar 

  3. Allington, D.: Linguistic capital and development capital in a network of cultural producers: mutually valuing peer groups in the ‘interactive fiction’ retrogaming scene. Cult. Sociol. 10(2), 267–286 (2016)

  4. Anglin, K.L.: Gather-narrow-extract: a framework for studying local policy variation using web-scraping and natural language processing. J. Res. Educ. Eff. 12(4), 685–706 (2019)

    Google Scholar 

  5. Bancroft, A.: Research in fractured digital spaces. Int. J. Drug Policy 73, 288–292 (2019)

    Article  Google Scholar 

  6. Boeing, G., Waddell, P.: New insights into rental housing markets across the united states: web scraping and analyzing craigslist rental listings. J. Plan. Educ. Res. 37(4), 457–476 (2017)

    Article  Google Scholar 

  7. Braun, M.T., Kuljanin, G., DeShon, R.P.: Special considerations for the acquisition and wrangling of big data. Organ. Res. Methods 21(3), 633–659 (2018)

    Article  Google Scholar 

  8. Burrows, R., Savage, M.: After the crisis? Big data and the methodological challenges of empirical sociology. Big Data Soc. 1(1), 2053951714540280 (2014)

    Article  Google Scholar 

  9. Caruana-Galizia, P., Caruana-Galizia, M.: Political land corruption: evidence from Malta-the European union’s smallest member state. J. Public Policy 38(4), 419–453 (2018)

  10. Cavallo, A.: Scraped data and sticky prices. Rev. Econ. Stat. 100(1), 105–119 (2018)

    Article  Google Scholar 

  11. Cesare, N., Lee, H., McCormick, T., Spiro, E., Zagheni, E.: Promises and pitfalls of using digital traces for demographic research. Demography 55(5), 1979–1999 (2018)

    Article  Google Scholar 

  12. Dewi, L.C., Chandra, A., et al.: Social media web scraping using social media developers api and regex. Procedia Comput. Sci. 157, 444–449 (2019)

    Article  Google Scholar 

  13. Dick, K., Charih, F., Woo, J., Green, J.R.: Gas prices of America: the machine-augmented crowd-sourcing era. In: 2020 17th Conference on Computer and Robot Vision (CRV), pp. 158–165. IEEE (2020)

  14. Din, M.F.: Breaching and entering: when data scraping should be a federal computer hacking crime. Brooklyn Law Rev. 81, 405 (2015)

    Google Scholar 

  15. Drivas, I.: Liability for data scraping prohibitions under the refusal to deal doctrine. Univ. Chic. Law Rev. 86(7), 1901–1940 (2019)

    Google Scholar 

  16. Edelmann, A., Wolff, T., Montagne, D., Bail, C.A.: Computational social science and sociology. Ann. Rev. Sociol. 46, 61–81 (2020)

    Article  Google Scholar 

  17. Edwards, A., Housley, W., Williams, M., Sloan, L., Williams, M.: Digital social research, social media and the sociological imagination: surrogacy, augmentation and re-orientation. Int. J. Soc. Res. Methodol. 16(3), 245–260 (2013)

    Article  Google Scholar 

  18. Fazekas, M., Tóth, I.J.: From corruption to state capture: a new analytical framework with empirical applications from Hungary. Polit. Res. Q. 69(2), 320–334 (2016)

    Article  Google Scholar 

  19. Felderer, B., Blom, A.G.: Acceptance of the automated online collection of geographical information. Sociol. Methods Res. 0049124119882480 (2019)

  20. Flisfeder, M.: Algorithmic Desire: Toward a New Structuralist Theory of Social Media. Northwestern University Press, Evanston (2021)

    Book  Google Scholar 

  21. Futschek, G.: Algorithmic thinking: the key for understanding computer science. In: International Conference on Informatics in Secondary Schools-Evolution and Perspectives. Springer, pp. 159–168 (2006)

  22. Galliher, J.F.: Social scientists’ ethical responsibilties to superordinates: looking upward meekly. Soc. Probl. 27, 298 (1979)

  23. Golder, S.A., Macy, M.W.: Digital footprints: opportunities and challenges for online social research. Ann. Rev. Sociol. 40, 129–152 (2014)

    Article  Google Scholar 

  24. Green, B., Viljoen, S.: Algorithmic realism: expanding the boundaries of algorithmic thought. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 19–31 (2020)

  25. Gregory, K.: Online communication settings and the qualitative research process: acclimating students and novice researchers. Qual. Health Res. 28(10), 1610–1620 (2018)

    Article  Google Scholar 

  26. Grimmer, J.: A bayesian hierarchical topic model for political texts: measuring expressed agendas in senate press releases. Polit. Anal. 18(1), 1–35 (2010)

    Article  Google Scholar 

  27. Haggerty, K.D.: Ethics creep: governing social science research in the name of ethics. Qual. Sociol. 27(4), 391–414 (2004)

    Article  Google Scholar 

  28. Hampton, K.N.: Studying the digital: directions and challenges for digital methods. Ann. Rev. Sociol. 43, 167–188 (2017)

    Article  Google Scholar 

  29. Hayes, A.L., Scott, T.A.: Multiplex network analysis for complex governance systems using surveys and online behavior. Policy Stud. J. 46(2), 327–353 (2018)

    Article  Google Scholar 

  30. Keuschnigg, M., Lovsjö, N., Hedström, P.: Analytical sociology and computational social science. J. Comput. Soc. Sci. 1(1), 3–14 (2018)

    Article  Google Scholar 

  31. Landers, R.N., Brusso, R.C., Cavanaugh, K.J., Collmus, A.B.: A primer on theory-driven web scraping: automatic extraction of big data from the internet for use in psychological research. Psychol. Methods 21(4), 475 (2016)

    Article  Google Scholar 

  32. Lazer, D., Radford, J.: Data ex machina: introduction to big data. Ann. Rev. Sociol. 43, 19–39 (2017)

    Article  Google Scholar 

  33. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., et al.: Computational social science. Science (New York, NY) 323(5915), 721–723 (2009)

    Article  Google Scholar 

  34. Li, F., Zhou, Y, Cai, T.: Trails of data: Three cases for collecting web information for social science research. Soc. Sci. Comput. Rev. (OnlineFirst) (2019)

  35. Lin, M., Lucas, H.C., Jr., Shmueli, G.: Research commentary-too big to fail: large samples and the p-value problem. Inf. Syst. Res. 24(4), 906–917 (2013)

    Article  Google Scholar 

  36. Luscombe, A., Walby, K.: Theorizing freedom of information: the live archive, obfuscation, and actor-network theory. Gov. Inf. Q. 34(3), 379–387 (2017)

    Article  Google Scholar 

  37. Maher, T.V., Seguin, C., Zhang, Y., Davis, A.P.: Social scientists’ testimony before congress in the united states between 1946–2016, trends from a new dataset. PLoS ONE 15(3), e0230104 (2020)

  38. Marres, N., Weltevrede, E.: Scraping the social? Issues in live social research. J. Cult. Econ. 6(3), 313–335 (2013)

    Article  Google Scholar 

  39. Massimino, B.: Accessing online data: web-crawling and information-scraping techniques to automate the assembly of research data. J. Bus. Logist. 37(1), 34–42 (2016)

    Article  Google Scholar 

  40. Mausolf, J.G.: Occupy the government: analyzing presidential and congressional discursive response to movement repression. Soc. Sci. Res. 67, 91–114 (2017)

    Article  Google Scholar 

  41. McFarland, D.A., McFarland, H.R.: Big data and the danger of being precisely inaccurate. Big Data Soc. 2(2), 2053951715602495 (2015)

    Article  Google Scholar 

  42. McFarland, D.A., Lewis, K., Goldberg, A.: Sociology in the era of big data: the ascent of forensic social science. Am. Sociol. 47(1), 12–35 (2016)

    Article  Google Scholar 

  43. Millington, B., Millington, R.: ‘The datafication of everything’: toward a sociology of sport and big data. Sociol. Sport J. 32(2), 140–160 (2015)

  44. Mitchell, R.: Web Scraping with Python: Collecting More Data from the Modern Web. O’Reilly Media, Newton (2018)

  45. Munzert, S., Rubba, C., Meißner, P., Nyhuis, D.: Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Wiley, Hoboken (2014)

    Book  Google Scholar 

  46. Nader, L.: Up the anthropologist: perspectives gained from ‘studying up’. In: Hymes, D. (ed.) Reinventing Anthropology, pp. 284–311. Random House, New York (1968)

    Google Scholar 

  47. Nelson, L.K.: Computational grounded theory: a methodological framework. Sociol. Methods Res. 49(1), 3–42 (2020)

    Article  Google Scholar 

  48. Nelson, L.K., Burk, D., Knudsen, M., McCall, L.: The future of coding: a comparison of hand-coding and three types of computer-assisted text analysis methods. Sociol. Methods Res. 50(1), 202–237 (2021)

    Article  Google Scholar 

  49. Nisser, A., Weidmann, N.B.: Online ethnic segregation in a post-conflict setting. Eur. J. Commun. 33(5), 489–504 (2018)

    Article  Google Scholar 

  50. Olmedilla, M., Martínez-Torres, M.R., Toral, S.: Harvesting big data in social science: a methodological approach for collecting online user-generated content. Comput. Stand. Interfaces 46, 79–87 (2016)

    Article  Google Scholar 

  51. Pina-Sánchez, J., Grech, D., Brunton-Smith, I., Sferopoulos, D.: Exploring the origin of sentencing disparities in the crown court: using text mining techniques to differentiate between court and judge disparities. Soc. Sci. Res. 84, 102343 (2019)

    Article  Google Scholar 

  52. Pina-Sánchez, J., Julian, V.R., Sferopoulos, D.: Does the crown court discriminate against Muslim-named offenders? A novel investigation based on text mining techniques. Br. J. Criminol. 59(3), 718–736 (2019a)

    Article  Google Scholar 

  53. Possamai-Inesedy, A., Nixon, A.: A place to stand: digital sociology and the Archimedean effect. J. Sociol. 53(4), 865–884 (2017)

    Article  Google Scholar 

  54. Possler, D., Bruns, S., Niemann-Lenz, J.: Data is the new oil-but how do we drill it? Pathways to access and acquire large data sets in communication science. Int. J. Commun. 13, 3894–3911 (2019)

    Google Scholar 

  55. Qiu, L., Chan, S.H.M., Chan, D.: Big data in social and psychological science: theoretical and methodological issues. J. Comput. Soc. Sci. 1(1), 59–66 (2018)

    Article  Google Scholar 

  56. Ravn, S., Barnwell, A., Barbosa Neves, B.: What is “publicly available data”? Exploring blurred public-private boundaries and ethical practices through a case study on Instagram. J. Empir. Res. Hum. Res. Ethics 15(1–2), 40–45 (2020)

  57. Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Albertson, B., Gadarian, S., Rand, D.: Topic models for open ended survey responses with applications to experiments. Am. J. Polit. Sci. 58, 1064–82 (2014)

    Article  Google Scholar 

  58. Salganik, M.J.: Bit by bit: social research in the digital age. Princeton University Press, Princeton (2019)

    Google Scholar 

  59. Savage, M., Burrows, R.: The coming crisis of empirical sociology. Sociology 41(5), 885–899 (2007)

    Article  Google Scholar 

  60. Scassa, T.: Ownership and control over publicly accessible platform data. Online Inf. Rev. 43(6), 986–1002 (2019)

    Article  Google Scholar 

  61. Schwartz, H.A., Ungar, L.H.: Data-driven content analysis of social media: a systematic overview of automated methods. Ann. Am. Acad. Pol. Soc. Sci. 659(1), 78–94 (2015)

    Article  Google Scholar 

  62. Shi, F., Shi, Y., Dokshin, F.A., Evans, J.A., Macy, M.W.: Millions of online book co-purchases reveal partisan differences in the consumption of science. Nat. Hum. Behav. 1(4), 1–9 (2017)

    Article  Google Scholar 

  63. Stoltz, D.S., Taylor, M.A.: Concept mover’s distance: measuring concept engagement via word embeddings in texts. J. Comput. Soc. Sci. 2(2), 293–313 (2019)

  64. Sugiura, L., Wiles, R., Pope, C.: Ethical challenges in online research: public/private perceptions. Res. Ethics 13(3–4), 184–199 (2017)

    Article  Google Scholar 

  65. Tracy, S.J.: Qualitative quality: eight “big-tent” criteria for excellent qualitative research. Qual. Inq. 16(10), 837–851 (2010)

  66. Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:14037400 (2014)

  67. Tzanetakis, M.: Comparing cryptomarkets for drugs. A characterisation of sellers and buyers over time. Int. J. Drug Policy 56, 176–186 (2018)

    Article  Google Scholar 

  68. Ulbricht, L.: Scraping the demos. Digitalization, web scraping and the democratic project. Democratization 27(3), 426–442 (2020)

    Article  Google Scholar 

  69. Von Krogh, G., Von Hippel, E.: The promise of research on open source software. Manag. Sci. 52(7), 975–983 (2006)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Alex Luscombe.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 136 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Luscombe, A., Dick, K. & Walby, K. Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences. Qual Quant (2021). https://doi.org/10.1007/s11135-021-01164-0

Download citation

Keywords

  • Web scraping
  • Digital methods
  • Law
  • Ethics
  • Algorithmic thinking
  • Access to information
  • Social science research