Skip to main content

Too Much Data? Opportunities and Challenges of Large Datasets and Cybercrime

  • 834 Accesses

Abstract

Never before have criminologists had such rich data about the communications of a wide variety of individuals involved at various stages of crime. We now have records of discussions held between cybercrime offenders going back 20 years. Indeed, given we now have over 70 million posts by almost two million users, we are encountering a different type of problem: we have too much data. Although the datasets potentially allow us to answer questions we never before thought were possible, we also face unique challenges such as categorization of large datasets and temporal shifts in users, topics, ideas, and ways of communications. One answer to this problem may lie in automation: using machine learning to classify and label posts and interactions at scale. In this chapter, we will outline some of the opportunities and challenges associated with using such large datasets, some of the ways we are currently addressing these challenges, and potential ways forward.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-74837-1_10
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-74837-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)
Fig. 10.1

(adapted from Hughes et al., 2019)

Fig. 10.2

(adapted from Pastrana et al., 2019)

References

  • Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine,16(7), 16–17.

    Google Scholar 

  • Bada, M., Chua, Y. T., Collier, B., & Pete, I. (2020). Exploring masculinities and perceptions of gender in online cybercrime subcultures. In Proceedings of the 2nd Annual Conference on the Human Factor in Cybercrime.

    Google Scholar 

  • Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the Third International ICWSM Conference (pp. 361–362).

    Google Scholar 

  • Benjamin, V., Li, W., Holt, T., & Chen, H. (2015). Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops. In 2015 IEEE International Conference on Intelligence and Security Informatics (ISI) (pp. 85–90).

    Google Scholar 

  • Bevensee, E., Aliapoulios, M., Dougherty, Q., Baumgartner, J., McCoy, D., & Blackburn, J. (2020). SMAT: The social media analysis toolkit. In Proceedings of the Fourteenth International AAAI Conference on Web and Social Media.

    Google Scholar 

  • Burrows, R., & Savage, M. (2014). After the crisis? Big Data and the methodological challenges of empirical sociology. Big Data and Society,1(1), 1–6.

    CrossRef  Google Scholar 

  • Caines, A., Pastrana, S., Hutchings, A., & Buttery, P. (2018). Automatically identifying the function and intent of posts in underground forums. Crime Science,7(19), 1–14.

    Google Scholar 

  • Cambridge Cybercrime Centre. (2019). Process for working with our data. Available at: https://www.cambridgecybercrime.uk/process.html.

  • Chan, J., & Moses, B. L. (2016). Is Big Data challenging criminology? Theoretical Criminology,20(1), 21–39.

    CrossRef  Google Scholar 

  • Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications,19(2), 171–209.

    CrossRef  Google Scholar 

  • Christie, N. (1997). Four blocks against insight: Notes on the oversocialization of criminologists. Theoretical Criminology,1(1), 13–23.

    CrossRef  Google Scholar 

  • Collier, B., Thomas, D. R., Clayton, R., & Hutchings, A. (2019). Booting the Booters: Evaluating the effects of police interventions in the market for denial-of-service attacks. In Proceedings of the ACM Internet Measurement Conference. Amsterdam.

    Google Scholar 

  • Davis, C. A., Ciampaglia, G. L., Aiello, L. M., Chung, K., Conover, M. D., Ferrara, E., Flammini, A., Fox, G. C., Gao, X., Gonçalves, B., Grabowicz, P. A., Hong, K., Hui, P., McCaulay, S., McKelvey, K., Meiss, M. R., Patil, S., Peli, C., Pentchev, V., … Menczer, F. (2016). OSoMe: The IUNI observatory on social media. PeerJournal of Computer Science,2, e87.

    Google Scholar 

  • Edwards, A., Housley, W., Williams, M., Sloan, L., & Williams, M. (2013). Digital social research, social media and the sociological imagination: Surrogacy, augmentation and re-orientation. International Journal of Social Research Methodology,16(3), 245–260.

    CrossRef  Google Scholar 

  • Gerritsen, C. (2020). Big data and criminology from an AI perspective. In B. Leclerc & J. Calle (Eds.), Big Data. Routledge.

    Google Scholar 

  • González‐Bailón, S. (2013). Social science in the era of big data. Policy and Internet,5(2), 147–160.

    Google Scholar 

  • Hayward, K. J., & Maas, M. M. (2020). Artificial intelligence and crime: A primer for criminologists. Crime, Media, Culture, 1741659020917434.

    Google Scholar 

  • Holt, T. J., & Dupont, B. (2019). Exploring the factors associated with rejection from a closed cybercrime community. International Journal of Offender Therapy and Comparative Criminology,63(8), 1127–1147.

    CrossRef  Google Scholar 

  • Hughes, J., Aycock, S., Caines, A., Buttery, P., & Hutchings, A. (2020). Detecting trending terms in cybersecurity forum discussions. Workshop on Noisy User-Generated Text (W-NUT).

    Google Scholar 

  • Hughes, J., Collier, B., & Hutchings, A. (2019). From playing games to committing crimes: A multi-technique approach to predicting key actors on an online gaming forum. In Proceedings of the APWG Symposium on Electronic Crime Research (eCrime). Pittsburgh.

    Google Scholar 

  • Hutchings, A., & Pastrana, S. (2019). Understanding eWhoring. In Proceedings of the 4th IEEE European Symposium on Security and Privacy. Stockholm.

    Google Scholar 

  • Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data and Society,1(1), 1–12.

    CrossRef  Google Scholar 

  • Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2009). Computational social science. Science (New York, NY),323(5915), 721.

    CrossRef  Google Scholar 

  • Lazer, D., & Radford, J. (2017). Data ex machina: Introduction to big data. Annual Review of Sociology,43, 19–39.

    CrossRef  Google Scholar 

  • Lee, J. R., & Holt, T. J. (2020). The challenges and concerns of using big data to understand cybercrime. In B. Leclerc & J. Calle (Eds.), Big Data. Routledge.

    Google Scholar 

  • Li, W., Chen, H., & Nunamaker, J. F., Jr. (2016). Identifying and profiling key sellers in cyber carding community: AZSecure text mining system. Journal of Management Information Systems,33(4), 1059–1086.

    CrossRef  Google Scholar 

  • Lynch, J. (2018). Not even our own facts: Criminology in the era of big data. Criminology,56(3), 437–454.

    CrossRef  Google Scholar 

  • Manyika, J. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. Available at: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

  • Metzler, K., Kim, D. A., Allum, N., & Denman, A. (2016). Who is doing computational social science? In Trends in big data research.

    Google Scholar 

  • Moore, T., Kenneally, E., Collett, M., & Thapa, P. (2019). Valuing cybersecurity research datasets. In 18th Workshop on the Economics of Information Security (WEIS).

    Google Scholar 

  • Motoyama, M., McCoy, D., Levchenko, K., Savage, S., & Voelker, G. M. (2011). An analysis of underground forums. In Proceedings of the 2011 ACM SIGCOMM Internet Measurement Conference (pp. 71–80).

    Google Scholar 

  • Nagin, D. S. (2005). Group-based modeling of development. Harvard University Press.

    CrossRef  Google Scholar 

  • Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Proceedings of the IEEE Symposium on Security and Privacy (sp 2008) (pp. 111–125).

    Google Scholar 

  • Newman, G. R., & Clarke, R. V. (2003). Superhighway robbery: Preventing E-commerce crime. Willan.

    Google Scholar 

  • Ozkan, T. (2019). Criminology in the age of data explosion: New directions. The Social Science Journal,56(2), 208–219.

    CrossRef  Google Scholar 

  • Pastrana, S., Hutchings, A., Caines, A., & Buttery, P. (2018a). Characterizing Eve: Analysing cybercrime actors in a large underground forum. In Proceedings of the 21st International Symposium on Research in Attacks, Intrusions and Defenses (RAID). Heraklion.

    Google Scholar 

  • Pastrana, S., Thomas, D. R., Hutchings, A., & Clayton, R. (2018b). CrimeBB: Enabling cybercrime research on underground forums at scale. In Proceedings of the 2018 World Wide Web Conference (pp. 1845–1854).

    Google Scholar 

  • Pastrana, S., Hutchings, A., Thomas, D. R., & Tapiador, J. (2019). Measuring eWhoring. In Proceedings of the ACM Internet Measurement Conference. Amsterdam.

    Google Scholar 

  • Pete, I., & Chua, Y. T. (2019). An assessment of the usability of cybercrime datasets. In 12th USENIX Workshop on Cyber Security Experimentation and Test (CSET 19).

    Google Scholar 

  • Pete, I., Hughes, J., Bada, M., & Chua, Y. T. (2020). A social network analysis and comparison of six dark web forums. In IEEE European Symposium on Security and Privacy (EuroS&PW) Workshop on Attackers and Cyber Crime Operations (WACCO).

    Google Scholar 

  • Porcedda, M. G., & Wall, D. S. (2019). Cascade and chain effects in big data cybercrime: Lessons from the talktalk hack. In IEEE European Symposium on Security and Privacy (EuroS&PW) Workshop on Attackers and Cyber Crime Operations (WACCO) (pp. 443–452).

    Google Scholar 

  • Smith, G. J., Bennett Moses, L., & Chan, J. (2017). The challenges of doing criminology in the big data era: Towards a digital and data-driven approach. The British Journal of Criminology,57(2), 259–274.

    CrossRef  Google Scholar 

  • Snaphaan, T., & Hardyns, W. (2019). Environmental criminology in the big data era. European Journal of Criminology, 1–22.

    Google Scholar 

  • Snijders, C., Matzat, U., & Reips, U. D. (2012). “Big Data”: Big gaps of knowledge in the field of internet science. International Journal of Internet Science,7(1), 1–5.

    Google Scholar 

  • Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine and Ethics,25(2–3), 98–110.

    CrossRef  Google Scholar 

  • Thomas, D. R., Clayton, R., & Beresford, A. R. (2017). 1000 days of UDP amplification DDoS attacks. In Proceedings of the 2017 APWG Symposium on Electronic Crime Research (eCrime) (pp. 79–84). IEEE.

    Google Scholar 

  • Tuckman, B. W. (1965). Developmental sequence in small groups. Psychological Bulletin,63(6), 384.

    CrossRef  Google Scholar 

  • Turk, K., Pastrana, S., & Collier, B. (2020) A tight scrape: Methodological approaches to cybercrime research data collection in adversarial environments. In Proceedings of the IEEE European Symposium on Security and Privacy Workshop on Attackers and Cyber-Crime Operations (WACCO).

    Google Scholar 

  • Vetterl, A., & Clayton, R. (2019). Honware: A virtual honeypot framework for capturing CPE and IoT zero days. In Proceedings of the 2019 APWG Symposium on Electronic Crime Research (eCrime) (pp. 1–13). IEEE.

    Google Scholar 

  • Vu, A.V., Hughes, J., Pete, I., Collier, B., Chua, Y. T., Shumailov, I., & Hutchings, A. (2020). Turning up the dial: The evolution of a cybercrime market through set-up, stable, and COVID-19 eras. In Proceedings of the ACM Internet Measurement Conference. Pittsburgh.

    Google Scholar 

  • Wang, F. Y., Carley, K. M., Zeng, D., & Mao, W. (2007). Social computing: From social informatics to social intelligence. IEEE Intelligent Systems,22(2), 79–83.

    CrossRef  Google Scholar 

  • Westlake, B. G., & Bouchard, M. (2016). Liking and hyperlinking: Community detection in online child sexual exploitation networks. Social Science Research,59, 23–36.

    CrossRef  Google Scholar 

  • Yar, M. (2005). The novelty of “Cybercrime”: An assessment in light of routine activity theory. European Journal of Criminology,2(4), 407–427.

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jack Hughes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Hughes, J., Chua, Y.T., Hutchings, A. (2021). Too Much Data? Opportunities and Challenges of Large Datasets and Cybercrime. In: Lavorgna, A., Holt, T.J. (eds) Researching Cybercrimes. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-74837-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-74837-1_10

  • Published:

  • Publisher Name: Palgrave Macmillan, Cham

  • Print ISBN: 978-3-030-74836-4

  • Online ISBN: 978-3-030-74837-1

  • eBook Packages: Law and CriminologyLaw and Criminology (R0)