Abstract
In the era of ever-growing online social networking communities, reports of online crimes of various forms and targeting are growing exponentially, highlighting the imperative need for the development and enforcement of solutions and measures aimed at early detection and prevention. Specifically, in today’s landscape, child sexual abuse (CSA) and especially online grooming activities are even more prominent given the more intense involvement of young people in these communities. Grooming detection techniques, designed using machine learning, have been at the forefront of prevention and protection of minors. However, current approaches face significant challenges that affect their efficacy and usability. In this chapter, we investigate the challenges faced towards creating effective grooming detection systems and propose future directions to be explored as part of CESAGRAM project’s response to child sexual abuse.
You have full access to this open access chapter, Download chapter PDF
Keywords
- Hosting service providers
- Terrorist content online (TCO)
- Hash sharing
- NLP
- Speech recognition
- Classification
- Computer vision
- Explainable AI
- Federated learning
- TCO regulation
- Removal order
Introduction
The growing number of reported crimes suspected of involving online sexual abuse [1], along with the proliferation of online social networking communities which mainly contain young people between 15 and 24 years old [2], has made the growing phenomenon of child sexual abuse (CSA) and especially online grooming activities, even more prominent. In particular, the increase in online risky behaviours such as sexting, grooming, and child prostitution has raised concerns among parents, educators, and mental health professionals [3]. Among these activities, grooming specifically poses a significant risk to the safety and well-being of children. Grooming is defined as the process of preparing a child, significant individuals, and the environment for the purpose of sexually abusing the child [4], with specific goals such as gaining access to the child, ensuring compliance, and maintaining secrecy to prevent disclosure. Overall, grooming not only reinforces the abusers’ patterns but can also be used to justify or deny their actions.
In the context of child grooming, Information and Communication Technologies (ICTs) are commonly utilised to recruit and exploit young individuals for sexual purposes within relationships based on trust between minors and adults [5]. The grooming process often begins with the perpetrator engaging in inappropriate online sexual activities or sending explicit content. To create a safer online space for children, machine learning methods have been developed to enable the automatic detection of grooming activities in online platforms. In this chapter, we analyse the current methods created to deal with online grooming and explore their challenges. Based on our findings, we propose future directions, as part of CESAGRAM project’s response to online child sexual exploitation and abuse, in improving online grooming detection and allowing for its use in many languages; thus far, the focus has been on English content only. CESAGRAMFootnote 1 is a two-year European-funded project (GA No. 101084974) which aims at tackling online child sexual exploitation and abuse through enhancing the understanding of the process of grooming, and more particularly, the way it is facilitated by technology, as well as its link to CSA and missing-children’s cases, a sector currently under-researched. Research, training, and awareness raising, development of a set of artificial intelligence (AI) tools which will facilitate the detection and prevention of grooming content online, and advocacy are the main pillars of the project activities per se during its 2-year lifespan.
Background
Child grooming commonly starts in an online setting, instigated by adults by forwarding inappropriate content or employing sexual activities to children. These actions aim to desensitise the child and increase the likelihood of future sexual abuse [6]. Although grooming methods may vary, certain constants can be observed throughout the process. The perpetrator intentionally desensitises the child both physically and psychologically, making them more susceptible to engaging in sexual activities. Techniques such as active involvement, power dynamics, and control are employed to manipulate the child and reduce their inhibitions [7]. A comprehensive understanding of the nature and characteristics of grooming is crucial in addressing the risks associated with online activities and ensuring the protection of young children. By recognising the complex aspects of grooming and its manifestation in the digital realm, strategies to effectively prevent and respond to this form of abuse can be developed.
In order to not only prevent and combat child grooming, both offline and online, but also, to protect children’s rights, several legislative efforts have been proposed and adopted on national, European, and international level. The United Nations Convention on the Rights of the Child (UNCRC),Footnote 2 the Universal Declaration of Human Rights,Footnote 3 along with the Charter of Fundamental Rights of the European Union and the European Convention on Human Rights (ECHR),Footnote 4 have been crucial treaties that ensure among others, the proper respect and protection of children’s rights and well-being. Parallel to those, the Council of Europe Convention (Lanzarote Convention)Footnote 5 has adopted specific measures on the Protection of Children against Sexual Exploitation and Sexual Abuse, complemented with the Directive 2011/93/EU of the European Parliament,Footnote 6 while a new RegulationFootnote 7 has been proposed to empower the prevention, detection, reporting, and removal of child sexual abuse material and grooming online, and the further support of the victims. Furthermore, the European Union has proposed a five-year strategy (2020–2025)Footnote 8 focusing on the need for better coordination among responsible stakeholders through multi-stakeholder cooperation, with the goal of having a strong legal framework in place and establishing a strengthened law enforcement response that facilitates Member States in addressing the new challenges stemming from emerging technological advancements.
Landscape of Available Grooming Data
Machine learning has been extensively leveraged to develop solutions that could enable effective detection of potential online child grooming activities. To allow for a robust creation of machine and deep learning models, the need for qualitatively and quantitatively labelled (annotated) datasets is more than mandatory. However, the availability of publicly available datasets containing grooming examples is rather limited possibly due to the sensitivity of the subject under study. Nevertheless, thus far there have been some initial attempts to create datasets that could be exploited by machine/deep learning models to tackle to the extent possible the problem at hand. One of the largest sources of predatory conversations comes from Perverted-Justice (PJ),Footnote 9 which contains chat logs of individuals convicted of grooming, conversing with decoy operators rather than actual victims. However, the majority of logs are over a decade old, with 2016 being the most recent, while the largest part of them is from earlier than 2010. Having data from so many years back can negatively affect the effective detection of potential grooming activities, as the models developed may struggle or even fail to capture recent changes in dialogue and predatory tactics, due to their potential outdated content.
ChatCoder2 [8], another source of data, consists of only predatory conversations extracted from Perverted-Justice. Overall, it contains 497 conversations (chats) and was mainly built for studying the semantic segmentation of grooming chats, characterising each segment as predatory or not. Moreover, the PAN12 [9] dataset consists of non-grooming conversations, obtained from the logs of IRC (Internet Relay Chat) channels and of the chatting site Omegle,Footnote 10 in conjunction with grooming ones. Non-grooming conversations also include cybersex between consenting adults among other non-predatory ones, while similarly to the PJ conversations, grooming chats are from decoy operations, whereas the non-grooming chats are with real people. Conversations are split into segments, with the dataset containing 222 k segments with only 2.58% of them being grooming; this distribution aims to mimic the real distribution of grooming chats on the Web. Finally, since the dataset was introduced in 2012, the conversations comprising it are up to that date.
In addition, a dataset combining the aforementioned two is known as PANC [10] and consists of non-predatory segments from PAN12 and full-length predator chats from ChatCoder2 divided into segments. Finally, PJZC [11] contains data originating from PAN12, organised in JSON format and re-organised by the authors in a way to fit their task of early grooming detection. Specifically, the authors combined predatory segments belonging to the same conversation and labelled the entire conversation as predatory, rather than the individual segments, with the aim of detecting early signs of grooming attempts in entire conversations.
Based on the above, several limitations can be observed with the grooming-related datasets available. First, the data come from earlier years, hindering the process of effectively identifying more recent manifestations of grooming activities. Additionally, they all contain decoys and not real victims, which can also hinder the effectiveness of machine/deep learning models when trying to detect grooming in conversations consisting of real victims and perpetrators; it is particularly difficult to imitate the real behaviour of other persons as each person has a unique way of reacting to a situation and consequently expressing their feelings. Finally, another less apparent limitation is the lack of multilingual grooming datasets, as all existing ones contain only conversations written in English, thus restricting their use in grooming detection for other languages.
Machine and Deep Learning Methods for Grooming Detection
As mentioned, machine and deep learning have been leveraged thus far to enable the detection of online grooming activities. In particular, grooming detection is tackled as a text-based binary classification problem over a set of chat messages, typically split into segments, with the goal of identifying whether a segment contains grooming or not. Each text consists of a sequence of words that must be converted into machine-readable representations before being fed to a machine/deep learning model. One of the most commonly used methods to this end is Term Frequency-Inverse Document Frequency (TF-IDF) [12, 13]. However, more recently, to enable a better representation of textual data, pre-trained word embeddings obtained from Word2Vec or GloVe [14] have been exploited that allow also capturing the relationships between words in a sequence.
Focusing on the models themselves, most works apply traditional machine learning-based solutions, including Support Vector Machines (SVM) [13, 15], k-Nearest Neighbors (kNN) [15], and Logistic Regression (LR) [16]. Additionally, feature extraction and grooming characteristic detection inside the examined chats were shown to aid in the classification process. In detail, for instance, it was found that providing the model with a binary vector denoting the existence of seventeen distinct grooming characteristics extracted from the text at hand (e.g. asking questions to know the risk of conversation and asking if the child is alone or under adult or friend supervision), instead of the TF-IDF vector representation of the actual text, leads to increased detection performance [15].
Deep learning-based models have also been employed for grooming detection, including multi-layer perceptron (MLP) [12] and convolutional neural networks (CNN) [14]. The former followed an author-based approach, where all messages in a conversation originating from the same author were grouped together to deduce whether there is any grooming activity. In the latter, it was first explored using recurrent neural networks (RNN), concluding that their performance would be inadequate when dealing with large segments of conversation as is typical in the field of grooming detection. To this end, they instead proposed the use of a CNN-based model whose performance is not degraded with that issue. Finally, through experimentation, they additionally found that providing CNN with the input data directly so that the model can learn the embeddings itself, can help increase the performance. In particular, in such a case the model will be able to learn task-specific word representations, especially for words commonly used in grooming chats, which are not present in pre-trained embedding models (such as Word2Vec or GloVe) that are often used in classification tasks.
Open Challenges
Despite the ever-increasing efforts to develop effective methods to deal with online grooming activities, the field still faces significant challenges that impede progress. First, as pointed out earlier, publicly available datasets are scarce and mostly come from a single source (namely, Perverted-Justice). However, even for the existing ones, their suitability is somewhat questionable, as they include data from even more than a decade ago, and therefore, there is an increased possibility that they cannot effectively capture today’s way of expressing (e.g. higher prominence of transliterations in recent years and different slang terms) and manifesting grooming overall. Additionally, as mentioned, the currently available datasets do not contain conversations conducted by real victims but only decoy operators, which raises questions as to whether the models that will be developed will be able to be effective in real-life scenarios. Finally, the absence of multilingual grooming datasets makes it difficult to apply grooming detection to non-English conversations, giving rise to another open challenge. As a countermeasure, language models (LMs) can be used to translate the existing datasets into the desired language; however, this approach could introduce bias to the dataset, or fail to capture the unique idioms of each language, thereby hindering detection effectiveness overall.
Focusing on the models themselves, and in particular on the important step of text representation, thus far existing approaches mostly make use of simple solutions, such as TF-IDF or non-contextual embeddings like Word2Vec and GloVe. With such approaches, the structure of a text (sentence) is not taken into consideration, but instead the representations are extracted for each utterance regardless of the context being used. These representations, while effective in certain scenarios, may not be the most effective in grooming detection, where the context in which each word is used could be vital in determining whether or not a conversation potentially contains grooming attempts. Thus, to facilitate the detection process, a possible solution could be to train and use contextual embeddings, such as BERT [17], that consider the context of a word in a sentence in contrast to the non-contextualised ones, while also enabling the representation of slang words commonly used in chats, that may not be present in pre-trained embedding models such as GloVe [18]. However, training models to provide contextual embeddings for grooming detection is a challenging task, as the amount of grooming-related data is limited, and such models require a large amount of instances as well as resources to provide high-quality embeddings.
Future Directions
In the online world, individuals can maintain multiple identities across different platforms, or even within the same one, with the goal of either deceiving a wider range of individuals or better concealing and maintaining their online identity; e.g. even if an account is detected for infringing behaviour, their activity can seamlessly continue [19]. As mentioned, often perpetrators resort to a similar course of action with the aim of deceiving their victims [20], e.g. through victim isolation and trust development, making it difficult to identify accounts that are managed by the same person in a timely manner. However, each individual’s personality is made up of a unique set of behaviours, experiences, and feelings, which is also reflected in the way of writing. To that end, the writing blueprint could be leveraged by automatic mechanisms known as identity resolution that allow for the uncovering of potential links among the unprecedented high number of online user accounts [21]. So far, identity resolution has been employed by law enforcement as a way to uncover previously unknown connections between actors that share common characteristics (e.g. similar address) [22], thus paving the way for its use in the fight against grooming activities as well. Stylometric attributes (e.g. vocabulary diversity or writing idiosyncrasies), as well as contextualised distributional semantic features (e.g. captured by BERT) can be leveraged in an attempt to identify multiple accounts likely to be operated by the same perpetrator [23]. Ultimately, unknown, well-hidden relationships can be revealed, thus allowing identification of further potential victims at early stages.
Similarly, adapting the way language is perceived by LMs through fine-tuning [24] to better reflect current trends in written language in online settings, such as the change in word meanings over time, slang terms, and transliteration, will be an invaluable asset in the development of more effective grooming detection systems. While such approaches require unlabelled and generic data, they do not circumvent the lack of training data for grooming detection. As such, in-depth experimental investigation is required in annotating new gold-standard data, creating synthetic data that simulate real behaviours to the extent possible [25], or considering transfer learning approaches such as few-shot and meta-learning [26, 27].
Notes
- 1.
Towards a Comprehensive European Strategy Against tech-facilitated Grooming And Missing; https://cesagramproject.eu/
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Negreiro, M. (2022, December). [Online]. Available: https://www.europarl.europa.eu/RegData/etudes/BRIE/2022/738224/EPRS_BRI(2022)738224EN.pdf
Petrosyan, A. (2023). Worldwide digital population 2023. Retrieved from https://www.statista.com/statistics/617136/digital-population-worldwide/
Estefenon, S. G. B., & Eisenstein, E. (2015). La sexualidad en la Era Digital. Adolescencia e Saude, 12, 83–87.
Craven, S., Brown, S., & Gilchrist, E. (2006). Sexual grooming of children: Review of literature and theoretical considerations. Journal of Sexual Aggression, 12, 287–299.
Wachs, S., Wolf, K., & Pan, C.-C. (2012). Cybergrooming: Risk factors, coping strategies and associations with cyberbullying. Psicothema, 24, 628–633.
Quayle, E., Allegro, S., Hutton, L., Sheath, M., & Lööf, L. (2014). Rapid skill acquisition and online sexual grooming of children. Computers in Human Behavior, 39, 368–375.
Berson, I. R. (2003). Grooming Cybervictims. Journal of School Violence, 2, 5–18.
Chat Coder 2 dataset. https://www.chatcoder.com/data.html. Accessed 27 June.
Inches, G., & Crestani, F. (2012). PAN12 deception detection: Sexual predator identification. Zenodo.
Vogt, M., Leser, U., & Akbik, A. (2021). Early detection of sexual predators in chats. In Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (Long Papers) (Vol. 1) Online.
Milon-Flores, D. F., & Cordeiro, R. L. F. (2022). How to take advantage of behavioral features for the early detection of grooming in online conversations. Knowledge-Based Systems, 240, 108017.
Bours, P., & Kulsrud, H. (2019). Detection of cyber grooming in online conversation. In 2019 IEEE international Workshop on Information Forensics and Security (WIFS).
Sulaiman, N. R., & Siraj, M. M. (2019). Classification of online grooming on chat logs using two term weighting schemes. International Journal of Innovative Computing, 9.
Ebrahimi, M., Suen, C. Y., & Ormandjieva, O. (2016). Detecting predatory conversations in social media by deep Convolutional Neural Networks. Digital Investigation, 18, 33–49.
Gunawan, F. E., Ashianti, L., Candra, S., & Soewito, B. (2016). Detecting online child grooming conversation. In 2016 11th international conference on Knowledge, Information and Creativity Support Systems (KICSS).
Pranoto, H., Gunawan, F. E., & Soewito, B. (2015). Logistic models for classifying online grooming conversation. Procedia Computer Science, 59, 357–365.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Borj, P. R., Raja, K., & Bours, P. (2023). Online grooming detection: A comprehensive survey of child exploitation in chat logs. Knowledge-Based Systems, 259.
Reuters Staff. (2019). Twitter suspends 100k accounts for creating new ones after suspension.
RAINN. (2018). Grooming: Know the warning signs.
Tsikerdekis, M., & Zeadally, S. (2014). Multiple account identity deception detection in social media using nonverbal behavior. IEEE Transactions on Information Forensics and Security, 9, 1311–1321.
Homeland Security. (2018). The role of identity resolution in criminal investigations.
Chatzakou, D., Soler-Company, J., Tsikrika, T., Wanner, L., Vrochidis, S., & Kompatsiaris, I. (2020). User identity linkage in social media using linguistic and social interaction features. In Proceedings of the 12th ACM conference on Web Science.
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics. Online.
Stylianou, N., Chatzakou, D., Tsikrika, T., Vrochidis, S., & Kompatsiaris, I. (2023). Domain-aligned data augmentation for low-resource and imbalanced text classification. In European conference on information retrieval.
Parnami, A., & Lee, M. (2022). Learning from few examples: A summary of approaches to few-shot learning. arXiv preprint arXiv, 2203.04291.
Vanschoren, J. (2019). Meta-learning. Automated machine learning: methods, systems, challenges, 35–61.
Acknowledgements
This work was supported by the CESAGRAM project funded by the European Union (Internal Security Fund) under Grant Agreement No. 101084974. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this chapter
Cite this chapter
Mylonas, N. et al. (2025). Online Child Grooming Detection: Challenges and Future Directions. In: Gkotsis, I., Kavallieros, D., Stoianov, N., Vrochidis, S., Diagourtas, D., Akhgar, B. (eds) Paradigms on Technology Development for Security Practitioners. Security Informatics and Law Enforcement. Springer, Cham. https://doi.org/10.1007/978-3-031-62083-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-62083-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62082-9
Online ISBN: 978-3-031-62083-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)