Never before have criminologists had such rich data about the communications of a wide variety of individuals involved at various stages of crime. We now have records of discussions held between cybercrime offenders going back 20 years. Indeed, given we now have over 70 million posts by almost two million users, we are encountering a different type of problem: we have too much data. Although the datasets potentially allow us to answer questions we never before thought were possible, we also face unique challenges such as categorization of large datasets and temporal shifts in users, topics, ideas, and ways of communications. One answer to this problem may lie in automation: using machine learning to classify and label posts and interactions at scale. In this chapter, we will outline some of the opportunities and challenges associated with using such large datasets, some of the ways we are currently addressing these challenges, and potential ways forward.
This is a preview of subscription content, access via your institution.
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine,16(7), 16–17.
Bada, M., Chua, Y. T., Collier, B., & Pete, I. (2020). Exploring masculinities and perceptions of gender in online cybercrime subcultures. In Proceedings of the 2nd Annual Conference on the Human Factor in Cybercrime.
Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the Third International ICWSM Conference (pp. 361–362).
Benjamin, V., Li, W., Holt, T., & Chen, H. (2015). Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. In 2015 IEEE International Conference on Intelligence and Security Informatics (ISI) (pp. 85–90).
Bevensee, E., Aliapoulios, M., Dougherty, Q., Baumgartner, J., McCoy, D., & Blackburn, J. (2020). SMAT: The social media analysis toolkit. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media.
Burrows, R., & Savage, M. (2014). After the crisis? Big Data and the methodological challenges of empirical sociology. Big Data and Society,1(1), 1–6.
Caines, A., Pastrana, S., Hutchings, A., & Buttery, P. (2018). Automatically identifying the function and intent of posts in underground forums. Crime Science,7(19), 1–14.
Cambridge Cybercrime Centre. (2019). Process for working with our data. Available at: https://www.cambridgecybercrime.uk/process.html.
Chan, J., & Moses, B. L. (2016). Is Big Data challenging criminology? Theoretical Criminology,20(1), 21–39.
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications,19(2), 171–209.
Christie, N. (1997). Four blocks against insight: Notes on the oversocialization of criminologists. Theoretical Criminology,1(1), 13–23.
Collier, B., Thomas, D. R., Clayton, R., & Hutchings, A. (2019). Booting the Booters: Evaluating the effects of police interventions in the market for denial-of-service attacks. In Proceedings of the ACM Internet Measurement Conference. Amsterdam.
Davis, C. A., Ciampaglia, G. L., Aiello, L. M., Chung, K., Conover, M. D., Ferrara, E., Flammini, A., Fox, G. C., Gao, X., Gonçalves, B., Grabowicz, P. A., Hong, K., Hui, P., McCaulay, S., McKelvey, K., Meiss, M. R., Patil, S., Peli, C., Pentchev, V., … Menczer, F. (2016). OSoMe: The IUNI observatory on social media. PeerJournal of Computer Science,2, e87.
Edwards, A., Housley, W., Williams, M., Sloan, L., & Williams, M. (2013). Digital social research, social media and the sociological imagination: Surrogacy, augmentation and re-orientation. International Journal of Social Research Methodology,16(3), 245–260.
Gerritsen, C. (2020). Big data and criminology from an AI perspective. In B. Leclerc & J. Calle (Eds.), Big Data. Routledge.
González‐Bailón, S. (2013). Social science in the era of big data. Policy and Internet,5(2), 147–160.
Hayward, K. J., & Maas, M. M. (2020). Artificial intelligence and crime: A primer for criminologists. Crime, Media, Culture, 1741659020917434.
Holt, T. J., & Dupont, B. (2019). Exploring the factors associated with rejection from a closed cybercrime community. International Journal of Offender Therapy and Comparative Criminology,63(8), 1127–1147.
Hughes, J., Aycock, S., Caines, A., Buttery, P., & Hutchings, A. (2020). Detecting trending terms in cybersecurity forum discussions. Workshop on Noisy User-Generated Text (W-NUT).
Hughes, J., Collier, B., & Hutchings, A. (2019). From playing games to committing crimes: A multi-technique approach to predicting key actors on an online gaming forum. In Proceedings of the APWG Symposium on Electronic Crime Research (eCrime). Pittsburgh.
Hutchings, A., & Pastrana, S. (2019). Understanding eWhoring. In Proceedings of the 4th IEEE European Symposium on Security and Privacy. Stockholm.
Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data and Society,1(1), 1–12.
Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2009). Computational social science. Science (New York, NY),323(5915), 721.
Lazer, D., & Radford, J. (2017). Data ex machina: Introduction to big data. Annual Review of Sociology,43, 19–39.
Lee, J. R., & Holt, T. J. (2020). The challenges and concerns of using big data to understand cybercrime. In B. Leclerc & J. Calle (Eds.), Big Data. Routledge.
Li, W., Chen, H., & Nunamaker, J. F., Jr. (2016). Identifying and profiling key sellers in cyber carding community: AZSecure text mining system. Journal of Management Information Systems,33(4), 1059–1086.
Lynch, J. (2018). Not even our own facts: Criminology in the era of big data. Criminology,56(3), 437–454.
Manyika, J. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. Available at: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
Metzler, K., Kim, D. A., Allum, N., & Denman, A. (2016). Who is doing computational social science? In Trends in big data research.
Moore, T., Kenneally, E., Collett, M., & Thapa, P. (2019). Valuing cybersecurity research datasets. In 18th Workshop on the Economics of Information Security (WEIS).
Motoyama, M., McCoy, D., Levchenko, K., Savage, S., & Voelker, G. M. (2011). An analysis of underground forums. In Proceedings of the 2011 ACM SIGCOMM Internet Measurement Conference (pp. 71–80).
Nagin, D. S. (2005). Group-based modeling of development. Harvard University Press.
Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Proceedings of the IEEE Symposium on Security and Privacy (sp 2008) (pp. 111–125).
Newman, G. R., & Clarke, R. V. (2003). Superhighway robbery: Preventing E-commerce crime. Willan.
Ozkan, T. (2019). Criminology in the age of data explosion: New directions. The Social Science Journal,56(2), 208–219.
Pastrana, S., Hutchings, A., Caines, A., & Buttery, P. (2018a). Characterizing Eve: Analysing cybercrime actors in a large underground forum. In Proceedings of the 21st International Symposium on Research in Attacks, Intrusions and Defenses (RAID). Heraklion.
Pastrana, S., Thomas, D. R., Hutchings, A., & Clayton, R. (2018b). CrimeBB: Enabling cybercrime research on underground forums at scale. In Proceedings of the 2018 World Wide Web Conference (pp. 1845–1854).
Pastrana, S., Hutchings, A., Thomas, D. R., & Tapiador, J. (2019). Measuring eWhoring. In Proceedings of the ACM Internet Measurement Conference. Amsterdam.
Pete, I., & Chua, Y. T. (2019). An assessment of the usability of cybercrime datasets. In 12th USENIX Workshop on Cyber Security Experimentation and Test (CSET 19).
Pete, I., Hughes, J., Bada, M., & Chua, Y. T. (2020). A social network analysis and comparison of six dark web forums. In IEEE European Symposium on Security and Privacy (EuroS&PW) Workshop on Attackers and Cyber Crime Operations (WACCO).
Porcedda, M. G., & Wall, D. S. (2019). Cascade and chain effects in big data cybercrime: Lessons from the talktalk hack. In IEEE European Symposium on Security and Privacy (EuroS&PW) Workshop on Attackers and Cyber Crime Operations (WACCO) (pp. 443–452).
Smith, G. J., Bennett Moses, L., & Chan, J. (2017). The challenges of doing criminology in the big data era: Towards a digital and data-driven approach. The British Journal of Criminology,57(2), 259–274.
Snaphaan, T., & Hardyns, W. (2019). Environmental criminology in the big data era. European Journal of Criminology, 1–22.
Snijders, C., Matzat, U., & Reips, U. D. (2012). “Big Data”: Big gaps of knowledge in the field of internet science. International Journal of Internet Science,7(1), 1–5.
Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine and Ethics,25(2–3), 98–110.
Thomas, D. R., Clayton, R., & Beresford, A. R. (2017). 1000 days of UDP amplification DDoS attacks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime) (pp. 79–84). IEEE.
Tuckman, B. W. (1965). Developmental sequence in small groups. Psychological Bulletin,63(6), 384.
Turk, K., Pastrana, S., & Collier, B. (2020) A tight scrape: Methodological approaches to cybercrime research data collection in adversarial environments. In Proceedings of the IEEE European Symposium on Security and Privacy Workshop on Attackers and Cyber-Crime Operations (WACCO).
Vetterl, A., & Clayton, R. (2019). Honware: A virtual honeypot framework for capturing CPE and IoT zero days. In Proceedings of the 2019 APWG Symposium on Electronic Crime Research (eCrime) (pp. 1–13). IEEE.
Vu, A.V., Hughes, J., Pete, I., Collier, B., Chua, Y. T., Shumailov, I., & Hutchings, A. (2020). Turning up the dial: The evolution of a cybercrime market through set-up, stable, and COVID-19 eras. In Proceedings of the ACM Internet Measurement Conference. Pittsburgh.
Wang, F. Y., Carley, K. M., Zeng, D., & Mao, W. (2007). Social computing: From social informatics to social intelligence. IEEE Intelligent Systems,22(2), 79–83.
Westlake, B. G., & Bouchard, M. (2016). Liking and hyperlinking: Community detection in online child sexual exploitation networks. Social Science Research,59, 23–36.
Yar, M. (2005). The novelty of “Cybercrime”: An assessment in light of routine activity theory. European Journal of Criminology,2(4), 407–427.
Editors and Affiliations
Rights and permissions
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Hughes, J., Chua, Y.T., Hutchings, A. (2021). Too Much Data? Opportunities and Challenges of Large Datasets and Cybercrime. In: Lavorgna, A., Holt, T.J. (eds) Researching Cybercrimes. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-74837-1_10
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-030-74836-4
Online ISBN: 978-3-030-74837-1
eBook Packages: Law and CriminologyLaw and Criminology (R0)