Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity

The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of the recent research uses small, heterogeneous datasets, without a thorough evaluation of applicability. In this paper, we further illustrate these issues, as we (i) evaluate many publicly available resources for this task and demonstrate difficulties with data collection. These predominantly yield small datasets that fail to capture the required complex social dynamics and impede direct comparison of progress. We (ii) conduct an extensive set of experiments that indicate a general lack of cross-domain generalization of classifiers trained on these sources, and openly provide this framework to replicate and extend our evaluation criteria. Finally, we (iii) present an effective crowdsourcing method: simulating real-life bullying scenarios in a lab setting generates plausible data that can be effectively used to enrich real data. This largely circumvents the restrictions on data that can be collected, and increases classifier performance. We believe these contributions can aid in improving the empirical practices of future research in the field.


Introduction
Learning to accurately classify rare phenomena within large feeds of data poses challenges for numerous applications of machine learning.The volume of data required for representative instances to be included is often resource-consuming, and limited access to such instances can severely impact the reliability of predictions.These limitations are particularly prevalent in applications dealing with sensitive social phenomena such as those found in the field of forensics: e.g., predicting acts of terrorism, detecting fraud, or uncovering sexually transgressive behavior.Their events are complex and require rich representations for effective detection.Conversely, online text, images, and meta-data capturing such interactions have commercial value for the platforms they are hosted on and are often off-limits to protect users' privacy.
An application affected by such limitations with increasing societal importance and growing interest over the last decade is that of cyberbullying detection.Not only is it sensitive, but the data is also inherently scarce in terms of public access.Most cyberbullying events are off-limits to the majority of researches, as they take place in private conversations.Fully capturing the social dynamics and complexity of these events requires much richer data than available to the research community up until now.Related to this, various issues with the operationalization of cyberbullying detection research were recently demonstrated by [50], who share much of the same concerns as we will discuss in this work.While their work focuses on methodological rigor in prior research, we will focus on the core limitations of the domain and complexity of cyberbullying detection.Through an evaluation of the current advances on the task, we illustrate how the mentioned issues affect current research, particularly cross-domain.Finally, we demonstrate crowdsourcing in an experimental setting to potentially alleviate the task's data scarcity.First, however, we introduce the theoretical framing of cyberbullying and the task of automatically detecting such events.

Cyberbullying
Asynchrony and optional anonymity are characteristic of online communication as we know it today; it heavily relies on the ability to communicate with people who are not physically present, and stimulates interaction with people outside of one's group of close friends through social networks [35].The rise of these networks brought various advantages to adolescents: studies show positive relationships between online communication and social connectedness [6,62], and that self-disclosure on these networks benefits the quality of existing and newly developed relationships [57].The popularity of social networks and instant messaging among children has resulted in this age group using devices that are connected to the Internet from increasingly younger ages [42], with 95% of teens1 ages 12-17 online, of which 80% are on social media [33].For them, however, the transition from social interaction predominantly taking place on the playground to being mediated through mobile devices [34] has also moved negative communication to a platform where indirect and anonymous interaction has a window into homes.
A range of studies conducted by the Pew Research Center 2 , most notably [33], provides detailed insight into these developments.While 78% of teens report positive outcomes from their social media interactions, 41% have experienced at least some adverse outcomes, ranging from arguments, trouble with school and parents, physical fights and ending friendships.From 19% bullied in the 12 months prior to the study, 8% of all teens reported this was some form of cyberbullying.These numbers are comparable to other research [39,28] (7% for grades 6-12, and 15% grades 9-12 respectively).Bullying has for a while been regarded as a public health risk by numerous authorities [67], with depression, anxiety, low self-esteem, school absence, lower grades, and risk of self-medication as primary concerns.
The act of cyberbullying-other than being conducted online-shares the characteristics of traditional bullying: a power imbalance between the bully and victim [54], the harm is intentional, repeated over time, and has a negative psychological effect on the victim [18].With the Internet as a communication platform however, some additional aspects arise: location, time, and physical presence have become an irrelevant factor in the act.Accordingly, several categories unique to this form of bullying are defined [66,5]: flaming (sending rude or vulgar messages), outing (posting private information or manipulated personal material of an individual without consent), harassment (repeatedly sending offensive messages to a single person), exclusion (from an online group), cyberstalking (terrorizing through sending explicitly threatening and intimidating messages), denigration (spreading online gossips), and impersonation.Moreover, in addition to optional anonymity hiding the critical figures behind an act of cyberbullying, it could also obfuscate the number of actors (i.e., there might only be one even though it seems there are more).Cyberbullying acts can prove challenging to remove once published; messages or images might persist through sharing and be viewable by many (as is typical for hate pages), or available to a few (in group or direct conversations).Hence, it can be argued that any form of harassment has become more accessible and intrusive.This online nature has an advantage as well: in theory, platforms record these bullying instances.Therefore, an increasing number of researches are interested in the automatic detection (and prevention) of cyberbullying.

Detection and Task Complexity
The task of cyberbullying detection can be broadly defined as the use of machine learning techniques to automatically classify text in messages on bullying content, or infer characteristic features based on higher-order information, such as user features or social network attributes.Bullying is most apparent in younger age groups through direct verbal outings [61], and more subtle in older groups, mainly manifested in more complex social dynamics such as exclusion, sabotage, and gossip [45].Therefore, the majority of work on the topic focuses on younger age groups, be it deliberately or given that the primary source for data is social media-which will likely result in these being highly present for some media [21].Apart from the well-established challenges that language-use poses (e.g., ambiguity, sarcasm), two factors in the event add further linguistic complexity, namely that of actor role and associated context.In contrast to tasks where adequate information is provided in the text of a single message alone, to completely map a cyberbullying event and pinpoint bully and victim implies some understanding of the dynamics between the involved actors and the concurrent textual interpretation.
Roles Firstly, there is a commonly made distinction between several actors within a cyberbullying event.A naive role allocation includes a bully B, a victim V and bystander BY , the latter of whom may or may not approve of the act of bullying.More nuanced models such as that of [67] include the additional roles of reinforcer BF, assistant AB, defender S, reporter R, and accuser A. Different roles can be assigned to one person; for example, being bullied and reporting this-they are visualized in Figure 1.Most importantly, all shown roles can be present in the span of one single thread on social media, as demonstrated in Table 1.While some roles clearly show from frequent interaction with either a positive or negative Each vertex represents an actor, labeled by their role in the event.Each edge indicates a stream of communication, labeled by whether this is positive (+) or negative (−) in nature, and its strength the frequency of interaction.The dotted vertices were added by [67] to account for social-media-specific roles.Table 1: Fictional example of a cyberbullying conversation.Lines represent sequential turns.Roles are noted as described on Page 3 (under the eponymous paragraph), if the message can be considered bullying by , and types according to [64].

Line Role Message
Bully Type 1 V me and my friends hanging out tonight! :) neutral 2 B @V lol b*tch, you dont have any friends.. ur fake as sh*t curse, insult 3 AB @B haha word, shes so sad encouragement 4 VF @V you know it girl 5 S @V dont listen to @B, were gonna have fun for sure! defense 6 V @B shut up @B!! nobody asked your opinion!!!! defense 7 A @B you are a f*cking bully, go outside or smt insult 8 B @V @S haha you all so dumb, just kill yourself already!insult, curse 9 A, R @B shut up or ill report you 10 B @A u gonna cry? go ahead, see what happens tomorrow!threat sentiment (B, V , A), others might not be observable through any form of conversation (R, BY ), are too subtle, or not distinguishable from other roles.
Context Secondly, the content of the messages has to be interpreted differently between these roles.While curse words can be a good indication of harassment, identification of a bully arguably requires more than these alone.Consider Table 1: both B and A use insults (lines 7-8), the message of V (line 6) might be considered as bullying in isolation, and having already determined B, the last sentence (line 10) can generally be regarded as a threat.
In conclusion, the full scope of the task is complex; it could have a temporal-sequential character, would benefit from determining actors and their interactions, and then should have some sense of severity as well (e.g.distinguish bullying from teasing).

Our Contributions
Surprisingly, a significant amount of work on the task does not collect (or use) data that allows for the inference of such features (which we will further elaborate on in Section 3).To confirm this, we reproduce part of the previous cyberbullying detection research on differ-ent sources.Predictions made by current automatic methods for cyberbullying classification are demonstrated not to reflect the above-described task complexity; we show performance drops across different training domains, and give insights into content feature importance and limitations.Additionally, we report on reproducibility issues in the current state-of-art work when subjected to our evaluation.To facilitate future reproduction, we will provide all code open-source, including dataset readers, experimental code, and qualitative analyses. 3inally, we present a method to collect crowdsourced cyberbullying data in an experimental setting.It grants control over the size and richness of the data, does not invade privacy, nor rely on external parties to facilitate data access.Most importantly, we demonstrate that it successfully increases classifier performance.With this work, we provide suggestions on improving methodological rigor and hope to aid the community in a more realistic evaluation and implementation of this task of societal importance.

Related Work
The task of detecting cyberbullying content can be roughly divided into three categories.First, research with a focus on binary classification, where it is only relevant if a message contains bullying or not.Second, more fine-grained approaches where the task is to determine either the role of actors in a bullying scenario or the content type (i.e., different categories of bullying).Both binary and fine-grained approaches predominantly focus on text-based features.Lastly, meta-data approaches that take more than just message content into account; these might include profile, network, or image information.Here, we will discuss efforts relevant to the task of cyberbullying classification within these three topics.We will predominantly focus on work conducted on openly available data, and those that report (positive) F 1 -scores, to promote fair comparisons. 4For an extensive literature review and a detailed comparison of different studies, see [50].

Binary Classification
One of the first traceable suggestions for applying text mining specifically to the task of cyberbullying detection is made by [30], who note that [68] previously tried to classify online harassment on the CAW 2.0 dataset. 5In the latter research, Yin et al. already state that the ratio of documents with harassing content to typical documents is challengingly small.Moreover, they foresee several other critical issues with regards to the task: a lack of positive instances will make detecting characteristic features a difficult task, and human labeling of such a dataset might have to face issues of ambiguity and sarcasm that are hard to assess when messages are taken out of conversation context.Even with very sparse datasets (with less than 1% positive class instances), the harassment classifier outperforms the random baseline using tf•idf, pronoun, curse word, and post similarity features..776[67] [71] XU TREC v Twitter 684 1762 [71] .780[16] [] DDV MSP x Myspace 311 8938 [16] .350[16] [] DDV YTB x YouTube 449 4177 [16] .640[9] [] BRT TWI v Twitter 220 5162 [9] .726[9] [] BRT TW2 v Twitter 194 2599 [9] .719[24] [] AMI ASK v Ask.fm 3787 86419 [24] .465[27] [12] HOS INS v Instagram 567 1387 [12] .783[58] [70] SUI TWI v Twitter 2102 5219 [70] .719 Following up [68], [47] note that the CAW 2.0 dataset is generally unfit for cyberbullying classification: in addition to lacking bullying labels (it only provides harassment labels), the conversations are predominantly between adults.Their work, along with [4], is a first effort to create datasets for cyberbullying classification through scraping the question-answering website Formspring.me, as well as Myspace. 6In contrast with similar research, they aim to use textual features while deliberately avoiding Bag-of-Words (BoW) features.Through a curse word dictionary and custom severity annotations, they construct several metrics for features related to these "bad" words.In their more recent paper, [31] redid analyses on the KON FRM set, primarily focusing on the contribution curse words have in the classification of bullying messages.By forming queries from curse word dictionaries, they show that there is no one combination which retrieves all.Moreover, using Essential Dimensions of Latent Semantic Indexing, they show potential for extracting messages containing harmful content, favoring high precision.
More recent efforts include [9], who combined word normalization, Named Entity Recognition to detect person-specific references, and multiple curse word dictionaries [41,10,38] in a rule-based pattern classifier, scoring well on Twitter data. 7Our own work [24], where we collected a large dataset with posts from Ask.fm, used standard BoW features as a first test.Later, these were extended in [63] with term lists, subjectivity lexicons, and topic model features.Recently popularized techniques of word embeddings and neural networks have been applied by [71,70] on XU TREC, NAY MSP and SUI TWI, both resulting in the highest performance for those sets.Convolutional Neural Networks (CNNs) on phonetic features were applied by [69] and [49] investigate among others the same architecture on textual features in combination with Long Short Term Memory Networks (LSTMs).Both [49] and that of [2] investigate the C-LSTM [72], the latter includes Synthetic Minority Over-sampling Technique (SMOTE).However, as we will show in the current research, both of these works suffer from reproducibility issues.Finally, fuzzified vectors of top-k word lists for each class were used to conduct membership likelihood-based classification by [48] on KON FRM, boosting recall over previously used methods.

Fine-Grained Classification
The common denominator of the previously discussed research was a focus on detecting single messages with evidence of cyberbullying per instance.The work of [67] proposes a more fine-grained approach by looking at bullying traces; i.e., the responses to a bullying incident.Their research is split up in a set of tasks on keyword-retrieved (bully) Twitter data: 8 (1) a text classification task where solely relying on uni+bigram features yielded the best result, (2) a role labeling task, where semantic role labeling was then used to distinguish person-mention roles, (3) the incorporation of sentiment in the sentiment analysis task (3) to determine teasing, where despite high accuracy, 48% of the positive instances were misclassified.Finally, (4) a latent topic modeling task, applying Latent Dirichlet Allocation to their corpus to note that some of the generated topics were relevant to bullying.Lastly, in our work, we demonstrated the difficulty of fine-grained approaches with simple BoW and sentiment features, especially detecting types of cyberbullying [24,64].

Meta-data Features
A notable, yet less popular aspect of this task is the utilization of a graph for visualizing potential bullies and their connections.This method was first adopted by [40], who use this information in combination with a classifier trained on LDA and weighted tf•idf features to detect bullies and victims on the CAW * datasets.Work that more concretely implements techniques from graph theory is that of [56], who used a wide range of features: network features to measure popularity (e.g., degree centrality, closeness centrality), content-based features, (length, sentiment, offensive words, second-person pronouns), and incorporated age, gender, and number of comments.They achieved the highest performance on KON FRM and BAY MSP.
Work by [27] focuses on Instagram posts and incorporates platform-specific features retrieved from images and its network.They are the first adhere to the literature more closely and define cyberagression [32] separately from cyberbullying, in that these are single negative posts rather than the repeated character of cyberbullying.They also show that certain LIWC (Linguistic Inquiry and Word Count) categories, such as death, appearance, religion, and sexuality, give a good indication of cyberbullying.While BoW features perform best, meta-data features (such as user properties and image content) in combination with textual features from the top 15 comments achieve a similar score.Cyberagression seems to be slightly easier to classify.

Task Evaluation Importance and Hypotheses
The domain of cyberbullying detection is in its early stages, as can be seen in Table 2. Most datasets are quite small, and only a few have seen repeated experiments.Given the substantial societal importance of improving the methods developed so far, pinpointing shortcomings in the current state of research should assist in creating a robust framework under which to conduct future experiments-particularly concerning evaluating (domain) generalization of the classifiers.The latter of which, to our knowledge, none of the current research seems involved with.This is therefore the main focus of our work.In this section, we define three motivations for assessing this.

Data Scarcity
Considering the complexity of the social dynamics underlying the target of classification, and the costly collection and annotation of training data, the issue of data scarcity can mostly be explained with respect to the aforementioned restrictions on data access: while on a small number of platforms most data is accessible without any internal access (commonly as a result of optional user anonymity), it can be assumed that a signifcant part of actual bullying takes place 'behind closed doors'.To uncover this, one would require access to all known information within a social network (such as friends, connections, and private messages, including all meta-data).As this is unrealistic in practice, researchers rely on the small subset of publicly accessible data (predominantly text) streams.Consequently, most of the datasets used for cyberbullying detection are small and exhibit an extreme skew between positive and negative messages (as can be seen in Table 3).It is unlikely that these small sets accurately capture the language-use on a given platform, and generalizable linguistic features of the bullying instances even less so.We therefore hypothesize that 1) the samples are underpowered in terms of accurately representing the substantial language variation between platforms, both in normal language-use and bullying-specific languageuse.

Task Definition
Furthermore, we argue that this scarcity introduces issues with adherence to the definition of the task of cyberbullying.The chances of capturing the underlying dynamics of cyberbullying (as defined in the literature) are slim with the message-level (i.e., using single documents only) approaches that the majority of work in the field has used up until now.The users in the collected sources have to be rash enough to bully in the open, and particular (curse) word use that would explain the effectiveness of dictionary and BoW-based approaches in previous research.Hence, we also hypothesize that 2) the positive instances are biased; only reflecting a limited dimension of bullying.A more realistic scenario-where characteristics such as repetitiveness and power imbalance are taken into consideration-would require looking at the interaction between persons, or even profile instances rather than single messages, which, as we argued, is not generally available.The work found in the meta-data category (Section 2.3) supports this argument with improved results using this information.
This theory regarding the definition (or operationalization) of this task is shared by Rosa et al., who pose that "the most representative studies on automatic cyberbullying detection, published from 2011 onward, have conducted isolated online aggression classification" [50, p. 341].We will mainly focus on the shared notion that this framing is limited to verbal aggression; however, our focus will empirically assess its overlap with data framed to solely contain online toxicity data (i.e., online / cyberagression) to find concrete evidence.

Domain Influence
Enriching previous work with data such as network structure, interaction statistics, profile information, and time-based analyses might provide fruitful sources for classification and a correct operationalization of the task.However, they are also domain-specific, as not all social media have such a rich interaction structure.Moreover, it is arguably naive to assume that social networks such as Facebook (for which in an ideal case, all aforementioned information sources are available) will stay a dominant platform of communication.Recently, younger age groups have turned towards more direct forms of communication such as WhatsApp, Snapchat, or media-focused forms such as Instagram [55].This move implies more private and less affluent environments in which data can be accessed (resulting in even more scarcity), and that further development in the field requires a critical evaluation of the current use of the available features, and ways to improve cross-domain generalization overall.This work, therefore, does not disregard textual features; they would still need to be considered as the primary source of information, while paying particular attention to the issues mentioned here.We further try to contribute towards this goal and hypothesize that 3) crowdsourcing bullying content potentially decreases the influence of domain-specific language-use, allows for richer representations, and alleviates data scarcity.

Data
For the current research, we distinguish a large variety of datasets.For those provided through the AMiCA (Automatic Monitoring in Cyberspace Applications) 10 project, the Ask.fm corpus is partially available open-source, 11 and the Crowdsourced corpus will be made available upon request.All other sources are publicly available datasets gathered from previous research 12 as discussed in Section 2. Corpus statistics of all data discussed below can be found in Table 3.The sets' abbreviations, language (EN for English, NL for Dutch), and brief collection characteristics can be found below.

AMiCA
Ask.fm (D ask , D ask nl , EN, NL) were collected from the eponymous social network by [24].Ask.fm is a question answering-style network where users interact by (frequently anonymously) asking questions on other profiles, and answering questions on theirs.As such, a third party cannot react to these question-answer pairs directly.The anonymity and restrictive interactions make for a high amount of potential cyberbullying.Profiles were retrieved through profile seed list, used as a starting point for traversing to other profiles and collecting all existing question-answer pairs for those profiles-these are predominantly Dutch and English.Each message was annotated with fine-grained labels (further details can be found in [64]); however, for the current experiments these were binarized, with any form of bullying being labeled positive.Donated (D don nl , NL) contains instances of (Dutch) cyberbullying from a mixture of platforms such as Skype, Facebook, and Ask.fm.The set is quite small; however, it contains several hate pages that are valuable collections of cyberbullying directed towards one person.The data was donated for use in the AMiCA project by previously bullied teens, thus forming a reliable source of gold standard, real-life data.
Crowdsourced (D sim nl , NL) originates from a crowdsourcing experiment conducted by [11], wherein 200 adolescents aged 14 to 18 partook in a role-playing experiment on an isolated SocialEngine 13 social network.Here, each respondent was given the account of a fictitious person and put in one of four roles in a group of six: a bully, a victim, two bystander-assistants, and two bystander-defenders.They were asked to read-and identify with-a character description and respond to an artificially generated initial post attributed 13 www.socialengine.com to one of the group members.All were confronted with two initial posts containing either low-or high-perceived severity of cyberbullying.

Related Work
Formspring (D f rm , EN) is taken from the research by [47] and is composed of posts from Formspring.me, a question-answering platform similar to Ask.fm.As Formspring is mostly used by teenagers and young adults, and also provides the option to interact anonymously, it is notorious for hosting large amounts of bullying content [7].The data was annotated through Mechanical Turk, providing a single label by majority vote for a question-answer pair.For our experiments, the question and answer pairs were merged into one document instance.
Myspace (D msp , EN) was collected by [4].As this was set up as an information retrieval task, the posts are labeled in batches of ten posts, and thus a single label applies to the entire batch (i.e., does it include cyberbullying).These were merged per batch as one instance and labeled accordingly.Due to this batching, the average tokens per instance are much higher than any of the other corpora.
Twitter (D twB , EN) by [9] was collected from the stream between 20-10-2012 and 30-12-2012, and was labeled based on a majority vote between three annotators.Excluding re-tweets, the main dataset consists of 220 positive and 5162 negative examples, which adheres to the general expected occurrence rate of 4%.Their comparably-sized test set, consisting of 194 positive and 2699 negative examples, was collected by adding a filter to the stream for messages to contain any of the words school, class, college, and campus.These sets are merged for the current experiments.
Twitter II (D twX , EN) from [67] focussed on bullying traces, and was thus retrieved by keywords (bully, bullying), which if left unmasked generates a strong bias when utilized for classification purposes (both by word use as well as being a mix of toxicity and victims).It does, however, allow for demonstrating the ability to detect bullying-associated topics, and (indirect) reports of bullying.

Experiment-specific
Ask.fm Context (C ask , C ask nl, EN, NL) -the Ask.fm corpus was collected on profile level, but prior experiments have focused on single message instances [63].Here, we aggregate all messages for a single profile, which is then labeled as positive when as few as a single bullying instance occurs on the profile.This aggregation shifts the task of cyberbullying message detection to victim detection on profile level, allowing for more access to context and profile-level severity (such as repeated harassment), and makes for a more balanced set (1,763 positive and 6,245 negative instances).Formspring Context (C f rm , EN) -similar to the Ask.fm corpus, was collected on profile level [47].However, the set only includes 49 profiles, some of which only include a single message.Grouping on full profile level would result in very few instances; thus, we opted for creating small 'context' in batches of five (of the same profile).Similar to the Ask.fm approach, if one of these messages contains bullying, it is labeled positive, balancing the dataset (565 positive and 756 negative instances).
Toxicity (D tox , EN) from Kaggle14 is a Toxic Comment Classification dataset created by Conversation AI 15 [59] which offers over 300k messages from Wikipedia comments with Crowdflower-annotated labels for toxicity (including subtypes).Noteworthy is how disjoint both the task and the platform are from the rest of the corpora used in this research.While toxicity shares many properties with bullying, the focus here is on single instances of insults directed to likely unknown people (to the harasser).Given Wikipedia as a source, the article and moderation focussed comments make it topically quite different from what one would expect on social media-the fundamental overlap being curse words, which is only one of many dimensions to be captured to detect cyberbullying (as opposed to toxicity).

Preprocessing
All texts were tokenized using spaCy [26]. 16No preprocessing was conducted for the corpus statistics in Table 3.All models (Section 5) applied lowercasing and special character removal only; other preprocessing decreased performance (see Table 6).

Descriptive Analysis
Both Table 3 and Figure 2 illustrate stark differences; not only across domains but more importantly, between in-domain training and test sets.Most do not exceed a Jaccard similarity coefficient over 0.20 (Figure 2), implying a large part of their vocabularies do not overlap.This contrast is not necessarily problematic for classification; however, it does hamper learning a general representation for the negative class.It also clearly illustrates how even more disjoint D twX (collected by trace queries) and D tox are from the rest of the corpora and splits.Finally, the descriptives (Table 3) further show significant differences in size, message length, class balance, and type/token ratios (i.e., writing level).In conclusion, it can be assumed that the language-use in both positive as negative instances will vary significantly, and that it will be challenging to model in-domain, and generalize out-of-domain.

Experimental Setup
We attempt to address the hypotheses posited in Section 3 and propose five main experiments.Experiments I and III deal with the problem of generalizability, whereas Experiment II and V will both propose a solution for restricted data collection.Experiment IV will reproduce a selection of the current state-of-the-art models for cyberbullying detection and subject them to our cross-domain evaluation, to be compared against our baselines.

Experiment I: Cross-Domain Evaluation
In this experiment, we introduce the cross-domain evaluation framework, which will be extended in all other experiments.For this, we initially perform a many-to-many evaluation of a given model (baseline or otherwise) trained individually on all available data sources, split in train and test.In later experiments, we extend this with a one-to-many evaluation.This setup implies that (i) we fit our model on some given corpus' training portion and evaluate prediction performance on all available corpora their test portions (many-to-many) individually.Furthermore, we (ii) fit on all corpora their train portions combined, and evaluate on all their test portions individually (one-to-many).In sum, we report on 'small' models trained on each corpus individually, as well as a 'large' one trained on them combined, for each test set individually.
For every experiment, hyper-parameter tuning was conducted through an exhaustive grid search, using nested cross-validation (with ten inner and three outer folds) on the training set to find the optimal combination of the given parameters.Any model selection steps were based on the evaluation of the outer folds.The best performing model was then refitted on the full training set (90% of the data) and applied to the test set (10%).All splits (also during cross-validation) were made in a stratified fashion, keeping the label distributions across splits similar to the whole set.Henceforth, all experiments in this section can be assumed to follow this setup.
The many-to-many evaluation framework intends to test Hypothesis 1 (Section 3.1), relating to language variation and cross-domain performance of cyberbullying detection.To facilitate this, we employ an initial baseline model: Scikit-learn's [43] Linear Support Vector Machine (SVM) [15,22] implementation trained on binary BoW features, tuned using the grid shown in Table 3, based on [63].Given its use in previous research, it should form a strong candidate against which to compare.To ascertain out-of-domain performance compared to this baseline, we report test score averages across all test splits, excluding the set the model was trained on (in-domain).
Consequently, we add an evaluation criterion to that of related work: a model should both perform overall best in-domain and achieve the highest out-of-domain performance on average to classify as a robust method.It should be noted that the selected corpora for this work are not all optimally representative for the task.The tests in our experiments should, therefore, be seen as an initial proposal to improve the task evaluation.

Experiment II: Gauging Domain Influence
In an attempt to overcome domain restrictions on language-use, and to further solidify our tests regarding Hypothesis 1, we aim to improve the performance of our baseline models through changing our representations in three distinct ways: i) merging all available training sets (as to simulate a large, diverse corpus), ii) by aggregating instances on user-level, and iii) using state-of-the-art language representations over simple BoW features in all settings.We define these experiments as such: Volume and Variety Some corpora used for training are relatively small, and can thus be assumed insufficient to represent held-out data (such as the test sets).One could argue that this can be partially mitigated through simply collecting more data or training on multiple domains.To simulate such a scenario, we merge all available cyberbullying-related training splits (creating D all ), which then corresponds to the one-to-many setting of the evaluation framework.The hope is that corpora similar in size or content (the Twitter sets, Ask.fm and Formspring, YouTube and Myspace) would benefit from having more (related) data available.Additionally, training a large model on its entirety facilitates a catch-all setting for assessing the average cross-domain performance of the full task (i.e.across all test sets when trained on all available corpora).This particular evaluation will be used in Experiment IV (replication) for model comparison.
Context Change Practically all corpora, save for MySpace and YouTube, have annotations based on short sentences, which is particularly noticeable in Table 3.This oneshot (i.e., based on a single message) method of classifying cyberbullying provides minimal content (and context) to work with.It does therefore not follow the definition of cyberbullying-as previously discussed in Section 3.2.As a preliminary simulation 17 of adding (richer) context, we merge the profiles of D ask and (batches of) D f rm into single context instances (creating C ask and C f rm , see Section 4).This allows us to compare models trained larger contexts directly to that of single messages, and evaluate how context restrictions affect performance on the task in general, as well as cross-domain.
Improving Representations Pre-trained word embeddings as language representation have been demonstrated to yield significant performance gains for a multitude of NLP-related tasks [14].Given the general lack of training data-including negative instances for many corpora-word features (and weightings) trained on the available data tend to be a poor reflection of the language-use on the platform itself, let alone other social media platforms.Therefore, pre-trained semantic representations provide features that in theory, should perform better in cross-domain settings.We consider two off-the-shelf embedding models per language that are suitable for the task at hand: for English, averaged 200-dimensional GloVe [44] vectors trained on Twitter 18 , and DistilBERT [51] sentence embeddings 19 [19].For Dutch, fastText embeddings [8] trained on Wikipedia 20 and word2vec [36,37] embeddings 21 [60] trained on the COrpora from the Web (COW) corpus [52] embeddings.The GLoVe, fastText, and word2vec embeddings were processed using Gensim22 [46].
As an additional baseline for this section, we include the Naive Bayes Support Vector Machine (NBSVM) from [65], which should offer competitive performance on text classification tasks. 23This model also served as a baseline for the Kaggle challenge related to D tox . 24NBSVM uses tf•idf-weighted uni and bi-gram features as input, with a minimum document frequency of 3, and corpus prevalence of 90%.The idf values are smoothed and tf scaled sublinearly (1 + log(tf)).These are then weighted by their log-count ratios derived from Multinomial Naive Bayes.
Tuning of both embeddings and NB representation classifiers is done using the same grid as Table 3, however replacing C with [1, 2, 3, 4, 5, 10, 25, 50, 100, 200, 500].Lastly, we opted for Logistic Regression (LR), primarily as this was used in the NBSVM implementation mentioned above, as well as fastText.Moreover, we found SVM using our grid to perform marginally worse using these features.The embeddings were not fine-tuned for the task.While this could potentially increase performance, it complicates direct comparison to our baselines-we leave this for Experiment IV.

Experiment III: Aggression Overlap
In previous research using fine-grained labels for cyberbullying classification (e.g., [63]) it was observed that cyberbullying classifiers achieve the lowest error rates on blatant cases of aggression (cursing, sexual talk, and threats), an idea that was further adopted by [50].To empirically test Hypothesis 2 (see Section 3.2)-related to the bias present in the available positive instances-we adapt the idea of running a profanity baseline from this previous work.However, rather than relying on look-up lists containing profane words, we expand this idea by training a separate classifier on toxicity detection (D tox ) and seeing how well this performs on our bullying corpora (and vice-versa).For the corpora with fine-grained labels, we can further inspect and compare the bullying classes captured by this model.
We argue that high test set performance overlap of a toxicity detection model with mod-els trained on cyberbullying detection gives strong evidence of nuanced aspects of cyberbullying not being captured by such models.Notably, in line with [50], that the current operationalization does not significantly differ from the detection of online aggression (or toxicity)-and therefore does not capture actual cyberbullying.Given enough evidence, both issues should be considered as crucial points of improvement for the further development of classifiers in this domain.

Experiment IV: Replicating State-of-the-Art
For this experiment, we include two architectures that achieved state-of-the-art results on cyberbullying detection.As a reference neural network model for language-based tasks, we used a Bidirectional [53,3] Long Short-Term Memory network [25,23] (BiLSTM), partly reproducing the architecture from [2].We then attempt to reproduce the Convolutional Neural Network (CNN) [29] used in both [49] and [2], and the Convolutional LSTM (C-LSTM) [72] used in [49].As [49] do not report essential implementation details for these models (batch size, learning rate, number of epochs), there is no reliable way to reproduce their work.We will, therefore, take [2] their implementation for the BiLSTM and CNN as the initial setup.Given that this work is available open-source, we run the exact architecture (including SMOTE) in our Experiment I and II evaluations.The architecture-specific details are as follows: Reproduction We initially adopt the basic implementation 25 by [2]: randomly initialized embeddings with a dimension of 50 (as the paper did not find significant effects of changing the dimension, nor initialization), run for 10 epochs with a batch size of 128, dropout probability of 0.25, and a learning rate of 0.01.Further architecture details can be found in our repository. 26We also run a variant with SMOTE on, and one from the provided notebooks directly. 27This and following neural models were run on an NVIDIA Titan X Pascal, using Keras [13] with Tensorflow [1] as backend.
BiLSTM For our own version of the BiLSTM, we minimally changed the architecture from [2], only tuning using a grid on batch size [32,64,128,256], embedding size [50,100,200,300], and learning rate [0.1, 0.01, 0.05, 0.001, 0.005].Rather than running for ten epochs, we use a validation split (10% of the train set) and initiate early stopping when the validation loss does not go down after three epochs.Hence-and in contrast to earlier experiments-we do not run the neural models in 10-fold cross-validation, but a straightforward 2-fold train and test split where the latter is 10% of a given corpus.Again, we are predominantly interested in confirming statements made in earlier work; namely, that for this particular setting tuning of the parameters does not meaningfully affect performance.
CNN We use the same experimental setup as for the BiLSTM.The implementations of [2,49] use filter window sizes of 3, 4, and 5-max pooled at the end.Given that the same grid is used, the word embedding sizes are varied and weights trained (whereas [49] use 300dimensional pre-trained embeddings).Therefore, for direct performance comparisons, [2] their results will be used as a reference.As CNN-based architectures for text classification are often also trained on character level, we include a model variant with this input as well.
C-LSTM For this architecture, we take an open-source text classification survey implementation. 28This uses filter windows of [10,20,30,40,50], 64-dimensional LSTM cells and a final 128 dimensional dense layer.Please refer to our repository for additional implementation details-for this and previous architectures.

Experiment V: Crowdsourced Data
Following up on the proposed shortcomings of the currently available corpora in Hypotheses 1 and 2, we propose the use of a crowdsourcing approach to data collection.In this experiment, we will repeat Experiment I and II with the best out-of-domain classifier from the above evaluations with three (Dutch29 ) datasets: D ask nl ; the Dutch part of the Ask.fm dataset used before, D sim nl ; our synthetic, crowdsourced cyberbullying data, and lastly D don nl ; a small donated cyberbullying test set with messages from various platforms (full overview and description of these three can be found in Section 4).The only notable difference to our setup for this experiment is that we never use D don nl as training data.Therefore rather than D all , the Ask.fm corpus is merged with the crowdsourced cyberbullying data to make up the D comb set.

Results and Discussion
We will now cover results per experiment, and to what extent these provide support for the hypotheses posed in Section 3. As most of these required backward evaluation (e.g., Experiment III was tested on sets from Experiment I), the results of Experiment I-III are compressed in Table 4. Table 6 comprises the Improving Representations part of Experiment II (under 'word2vec' and 'DistilBERT') along with the preprocessing results effect of our baselines.The results of Experiment V can be found in Table 7.For brevity of reporting, the latter two only report on the in-domain scores, and feature the out-of-domain averages for the D all models for comparison, and D tox averages in Table 7.

Experiment I
Looking at Table 4, the upper group of rows under T1 represents the results for Experiment I. We posed in Hypothesis 1 that samples are underpowered regarding their representation of the language variation between platforms, both for bullying and normal language-use.The data analysis in Section 4.5 showed minimal overlap between domains in vocabulary and notable variances in numerous aspects of the available corpora.Consequently, we raised doubts regarding the ability of models trained on these individual corpora to generalize to other corpora (i.e., domains).Firstly, we consider how well our baseline performed on the in-domain test sets.For half of the corpora, it performs best overall on these specific sets (i.e., the test set portion of the data the model was trained on).More importantly, this entails that for four of the other sets, models trained on other corpora perform equal or better.Particularly the effectiveness of D ask was in some cases surprising; the YouTube corpus by [17] (D ytb ), for example, contains much longer instances (see Table 2).
It must be noted though, that the baseline was selected from work on the Ask.fm corpus [63].This data is also one of the more diverse datasets (and largest) with exclusively short messages; therefore, one could assume a model trained on this data would work well on both longer and shorter instances.It is however also likely that particularly this baseline (binary word features) trained on this data therefore enforces the importance of more shallow features.This we will be further explored in Experiments II and III.
For Experiment I, however, our goal was to assess the out-of-domain performance of these classifiers, not to maximize performance.For this, we turn to the Avg column in Table 4. Between the top portion of the Table, the D ask model performs best across all domains (achieving highest on three, as mentioned above).The second-best model is trained on the Formspring data from [47] (D f rm ), akin to Ask.fm as a domain (question-answer style, option to post anonymously).It can be observed that almost all models perform worst on the 'bullying traces' Twitter corpus by [67], which was collected using queries.This result is relatively unsurprising, given the small vocabulary overlaps with its test set shown in Figure 2. We also confirm in line with [47] that the CAW data from [4] is unfit as a bullying corpus; achieving significant positive F 1 -scores with a baseline, generalizing poorly and proving difficult as a test set.Additionally, we observe that even the best performing models yield between .1 and .2lower F 1 scores on other domains, or a 15 − 30% drop from the original score.To explain this, we look at how well important features generalize across test sets.As our baseline is a Linear SVM, we can directly extract all grams with positive coefficients (i.e., related to bullying).Figure 4 (right) shows the frequency of the top 5,000 features with the highest coefficient values.These can be observed to follow a Zipfian-like distribution, where the important features most frequently occur in one test set (25.5%) only, which quickly drops off with increasing frequency.Conversely, this implies that over 75% of the top 5,000 features seen during training do not occur in any test instance, and only 3% generalize across all sets.This coverage decreases to roughly 60% and 4% respectively for the top 10,000, providing further evidence of the strong variation in predominantly bullying-specific language-use.
Figure 4 (left) also indicates that the coefficient values are highly unstable across test sets, with most having roughly a 0.4 standard deviation.Note that these coefficient values can also flip to negative for particular sets, so for some of the features, the range goes from associated with the other class to highly associated with bullying.Given the results of Table 4 and Figure 4, we can conclude that our baseline model shows not to generalize out-of-domain.Given the quantitative and qualitative results reported on in this Experiment, this particular setting partly supports Hypothesis 1.

Experiment II
The results for this experiment can be predominantly found in Table 4 (middle and lower parts, and T2 in particular), and partly in Table 6 (word2vec, DistilBERT).In this experiment, we seek to further test Hypothesis 1 by employing three methods: merging all cyberbullying data to increase volume and variety, aggregating on context level for a context change, and improving representations through pre-trained word embedding features.These are all reasonably straightforward methods that can be employed in an attempt to mitigate data scarcity.

Volume and Variety
The results for this part are listed under D all in Table 4.For all of the following experiments, we now focus on the full results table (including that of Experi-Table 5: Examples of uni-gram weights according to the baseline SVM trained D all , tested on D twB and D ask .Words in red are associated with bullying, words in green with neutral content.The color intensity is derived from the strength of the SVM coefficients per feature (most are near zero).Black boxes indicate OOV words.Labels are divided between the gold standard (y) and predicted ( ŷ) labels, for bullying content, for neutral.
about to leave this school library and take my *ss homeeee bigerrr ?how much ?its gon na touch the sky ? a wonder d*ck ?you p*ss me off so much .r u a r*t*rd liam mate f*ck off @username i will skull drag you across campus .h* of me xoxoxoxoxoxoox ment I) and see which individual classifiers generalize best across all test sets (highlighted in gray).The Avg column shows that our 'big' model trained on all available corpora 30achieves second-best performance on half of the test sets and best on the other half.More importantly, it has the highest average out-of-domain performance, without competition on any test set.These observations imply that for the baseline setting, an ensemble model of different smaller classifiers should not be preferred over the big model.Consequently, it can be concluded that collecting more data does seem to aid the task as a whole.
However, a qualitative analysis of the predictions made by this model clearly shows lingering limitations (see Table 5).These three randomly-picked examples give a clear indication of the focus on blatant profanity (such as d*ck, p*ss, and f*ck).Especially combinations of words that in isolation might be associated with bullying content (leave, touch) tend to confuse the model.It also fails to capture more subtle threats (skull drag) and infrequent variations (h*).Both of these structural mistakes could be mitigated by providing more context that potentially includes either more toxicity or more examples of neutral content to decrease the impact of single curse words-hence, the next experiment.
Context Change As for access to context scopes, we are restricted to the Ask.fm and Formspring data (C f rm and C ask in Table 4).Nevertheless, in both cases, we see a noticeable increase for in-domain performance: a positive F 1 score of .579for context scope versus .561 on Ask.fm, and .758versus .454 on Formspring respectively.This increase implies that considering message-level detection for both individual sets should be preferred.On the other hand, however, these longer contexts do perform worse on out-of-domain sets.
On manual inspection of the feature differences between the other sets, D ask and D f rm individually, and C f rm and C ask , the scope shift clearly shows in their importances.From a sample of 500 top features occurring in the test set, 63% are profane words.For the Table 6: Overview of different feature representations (Repr) for Experiment I and II.The '+' parts show performance for preprocessing: removing all special characters (clean), and more sophisticated handling of social media tags and emojis (preproc).Their in-domain positive class F 1 scores for Experiment I (T1) and II (T2), and the out-of-domain average (Avg) for D all .Baseline scores are from Table 4 models trained on Ask.fm and Formspring this is an average of 42%, and models trained on both context scopes, it is significantly reduced to 11%.Many important bi-gram features include you, topics such as dating, boys, girls, and girlfriend occur, yet also positive words such as (are) beautiful-the latter of which could indicate messages from friends (defenders).This change is to an extent expected as by changing the scope, the task shifts to classifying profiles that are bullied, thus showing more diverse bullying characteristics.These results provide evidence for extending classification to contexts to be a worthwhile platform-specific setting to pursue.However, we can conversely draw the same conclusions as Experiment I; that including direct context does not overcome the tasks general domain limitations, therefore further supporting Hypothesis 1.A plausible solution to this could be improving upon the BoW features by relying on more general representations of language, as found in word embeddings.

Improving Representations
The aim for this experiment was to find (out-of-the-box) representations that would improve upon the simple BoW features used in our baseline model (i.e., achieving good in-domain performance as well as out-of-domain generalization).Table 6 lists both of our considered baselines, tested under different preprocessing methods.These are then compared against the two different embedding representations.
For preprocessing, several levels were used: the default for all models being 1) lowercasing only, then either 2) removal of special characters, or 3) lemmatization and more appropriate handling of special characters (e.g., splitting #word to prepend a hashtag token) were added.The corresponding results in Table 6 do not reveal an unequivocal preprocessing method for either the BoW baseline or NBSVM.While the latter achieves highest out-of-domain generalization with thorough preprocessing ('+preproc', .566positive F 1 ), the baseline model achieves best in-domain performance on five out of nine corpora, and an on-par out-of-domain average (.566 versus .561)with simple cleaning ('+clean').
According to our criterion proposed in Section 5.1, the method with good in-and out-of-domain should be preferred.The current consideration of preprocessing methods illustrates how this stricter evaluation criterion used in this experiment potentially yields different overall results in contrast to evaluating in-domain only, or focusing on single corpora.Conversely, we opted for simple cleaning throughout the rest of our experiment (as mentioned in Section 4.4), given its consistent performance for both baselines.
The embeddings do not seem to provide representations that yield and overall improvement for the classification performance of our Logistic Regression model.Surprisingly, however, DistilBERT does yield significant gains over our baseline for the conversationlevel corpus of Ask.fm (.629 positive F 1 over .579).This might imply that such representations would work well on more (balanced) data, although fine-tuning would be a requirement for drawing strong conclusions.Moreover, given that we restricted our embeddings to averaged representations on document-level for word2vec, and the sentence representation token for BERT, other settings remain unexplored; however, are not in scope of the current work.Therefore, we can conclude that no other alternative (out-of-the-box) baselines seem to clearly outperform our BoW baseline.We previously eluded to its effectiveness in previous work, and argued this being a result of capturing blatant profanity.We will further test this in the next experiment.

Experiment III
Here, we investigate Hypothesis 2: the notion that positive instances across all cyberbullying corpora are biased, and only reflect a limited dimension of bullying.We have already found strong evidence for this in the previous Experiments I and II, Figure 4, Table 5, and manual analyses of top features all indicated toxicity to be consistent top-ranking features.To add more empirical evidence to this, we trained models on toxicity, or cyber aggression, and tested them on bullying data (and vice-versa)-providing results on the overlap between the tasks.The results for this experiment can be found in the lower end of Table 4, under D tox and T3.
It can be noted that there is a substantial gap in performance between the cyberbullying classifiers (using D all as reference) performance on the D tox test set and that of the toxicity model (positive F 1 score of .587and .806respectively).More strikingly, however, the other way around, toxicity classifiers perform second-best on the out-of-domain averages (Avg in Table 4).In the context scopes (C f rm and C ask ) it is notably close, and for other sets relatively close, to the in-domain performance.
Cyberbullying detection should include detection of toxic content, yet also perform on more complex social phenomena, likely not found in the Wikipedia comments of the toxicity corpus.It is therefore particularly surprising that it achieves higher out-of-domain performance on cyberbullying classification than all individual models using BoW features to capture bullying content.Only when all corpora are combined, the D all classifier performs better than the toxicity model.This observation combined with previous results provides significant evidence that a large part of the available cyberbullying content is not complex, and current models to only generalize to a limited extent using predominately simple aggressive features, supporting Hypothesis 3.

Experiment IV
So far, we have attempted to improve a straight-forward baseline that was trained on binary features with several different approaches.While changes in data (representations) seem to have a noticeable effect on performance (increasing the amount of messages per instance, merging all corpora), none of the experiments with different feature representations have had an impact.With the current experiment, we had hoped to leverage earlier state-of-the-art architectures by reproducing their methodology and subjecting our evaluation framework.
As can be inferred from Table 7, our baselines outperform these neural techniques on almost all in-domain tests, as well as the out-of-domain averages.Having strictly upheld the experimental set-up from [2] and as close as possible that of [49], we can conclude that-under stricter evaluation-there is sufficient evidence that these models do not provide state-of-the-art results on the task of cyberbullying. 31Tuning these networks (at least in our set-up) does not seem to improve performance, rather decrease it.This indicates that the validation set on which early stopping is conducted is often not representative to the test set.Parameter tuning on this set is consequently sensitive to overfitting; an arguably unsurprising result given the size of the corpora. 31Upon acquiring the results of the replication of [2] (in particular failing to replicate the effect of the paper's oversampling) we investigated the provided code and notebooks.It is our understanding that oversampling before splitting the dataset into training and test sets causes the increase in performance; we measured overlap of positive instances in these splits and found no unique test instances.Furthermore, after re-running the experiments directly from the notebooks with the oversampling conducted post-split, the effect was significantly decreased (similar to our results in Table 7).The authors were contacted with our observations in March 2019.They unfortunately have not yet confirmed our results.Their repository remains unchanged as of October 2019.Our analyses can be found here: https://github.com/cmry/amica/tree/master/reproduction.Some further noteworthy observations can be made related to the performance of the CNN architecture, achieving quite significant leaps on word level (for D twB ) and character level (for C ask ).Particularly the conversation scopes (C, with a comparitatively balanced class distribution) see much more competitive perfomance compared to the baselines.The same effect can be observed when more data is available; both averages test scores for D all and D tox are comparable to the baseline across almost all architectures.Additionally, the D tox scores indicate that all architectures show about the same overlap on toxicity detection, although interestingly, less so for the neural models than for the baselines.
It can therefore be concluded that the current neural architectures do not provide a solution to the limitations of the task, rather, suffering more in performance.Our experiments do, however, once more illustrate that the proposed techniques of improving the representations of the corpora (by providing more data through merging all sources, and balancing by classifying batches of multiple messages, or conversations) allow the neural models to approach the baseline ballpark.As our goal here was not to completely optimize these architectures, but replication, the proposed techniques still could provide more avenues for further research.Finally, given its robust performance, we will continue to use the baseline model for the next eperiment.

Experiment V
Due to the nature of its experimental set-up (which generates balanced data with simple language-use, as shown in Table 2), the crowdsourced data proves easy to classify.Therefore, we do not report out-of-domain averages, as this set would skew them too optimistically, and be uninformative.Regardless, we are primarily interested in performance when crowdsourced data is added, or used as a replacement for real data.In contrast to the other experiments, the focus will mostly be on the Ask.fm (D ask nl ) and donated (D don nl ) scores (see Table 8).The scores on the Dutch part of the Ask.fm corpus are quite similar to those on the English corpus (.561 vs .598positive F 1 score), which is line with earlier results [63].Moreover, particularly for the small amount of data, the crowdsourced corpus performs surprisingly well on D ask nl (.516), and significantly better on the donated test data (.667 on D don nl ).
In the settings that utilize context representations, training on conversation scopes initially does not seem to improve detection performance in any of the configurations (save for a marginal gain on D simC nl ).However, it does simplify the task in a meaningful way at test-time; whereas a slight gain is obtained for message-level D ask nl (from .598F 1 -score to .608), when merging both datasets a significant performance boost can be found when training on D comb and testing on D askC nl (from .264and .501 to .801 on the combined).Hence, it can be concluded that enriching the existing training set with crowdsourced data yields promising improvements.
Based on these results, we confirm the Experiment II results hold for Dutch: more diverse, larger datasets, and increasing context sizes contributes to better performance on the task.Most importantly, there is enough evidence to support Hypothesis 3: the data generated by the crowdsourcing experiment helps detection rates for our in-the-wild test set, and its combination with externally collected data increases performance with and without additional context.

Suggestions for Future Work
We hope our experiments have helped shed light, and will raise further attention regarding multiple issues with methodological rigor pertaining the task of cyberbullying detection.It is our understanding that the disproportionate amount of work on the (oversimplified) classification task, versus the lack of focus on constructing rich, representative corpora reflecting the actual dynamics of bullying, has made critical assessment of the advances in this task difficult.We would therefore want to particularly stress the importance of simple baselines and the out-of-domain tests that we included in the evaluation criterion for this research.They would provide a fairer comparison for proposed novel classifiers, and a more unified method of evaluation.
Furthermore, novel research would benefit from explicitly finding evidence to support its assumptions that classifiers labeled 'cyberbullying detection' do more than one-shot, message-level toxicity detection.We would argue that the current framing of the majority of work on the task is still too limited to be considered theoretically-defined cyberbullying classification.In our research, we demonstrated several qualitative and quantitative methods that can facilitate such analyses.As popularity of the application of cyberbullying detection is increasing, this would avoid misrepresenting the conducted work, and that of possible in-the-wild applications in the future.
While we demonstrated a method of collecting plausible cyberbullying with guaranteed consent, the more valuable sources of real-life bullying that allow for complex models of social interaction remain restricted.It is our expectation that future modeling will benefit from the construction of much larger (anonymized) corpora-as most fields dealing with language have, and we therefore hope to see future work heading this direction.

Conclusion
In this work, we identified several issues that affect the majority of the current research on cyberbullying detection.As it is difficult to collect accurate cyberbullying data in the wild, the field suffers from data scarcity.In an optimal scenario, rich representations capturing all required meta-data to model the complex social dynamics of what the literature defines as cyberbullying would likely prove fruitful.However, one can assume such access to remain restricted for the time being, and with current social media moving towards private communication, to not be generalizable in the first place.Thus, significant changes need to be made to the empirical practices in this field.To this end, we provided a cross-domain evaluation setup and tested several cyberbullying detection models, under a range of different representations to potentially overcome the limitations of the available data, and provide a fair, rigorous framework to facilitate direct model comparison for this task.
Additionally, we formed three hypotheses we would expect to find evidence for during these evaluations: 1) the corpora are too small and heterogeneous to represent the strong variation in language-use for both bullying and neutral content across platforms accurately, 2) the positive instances are biased, predominantly capturing toxicity, and no other dimensions of bullying, and finally 3) crowdsourcing poses a resource to generate plausible cyberbullying events, and that can help expand the available data and improve the current models.
We found evidence for all three hypotheses: previous cyberbullying models generalize poorly across domains, simple BoW baselines prove difficult to improve upon, there is considerable overlap between toxicity classification and cyberbullying detection, and crowdsourced data yields well-performing cyberbullying detection models.We believe that the results of Hypotheses 1) and 2 in particular are principal hurdles that need to be tackled to advance this field of research.Furthermore, we showed that both leveraging training data from all openly available corpora, and shifting representations to include context meaningfully improves performance on the overall task.Therefore, we believe both should be considered as an evaluation point in future work.More so given that we show that these do not solve the existing limitations of the currently available corpora, and could therefore provide avenues for future research focusing on collecting (richer) data.Lastly, we show reproducibility of models that previously demonstrated state-of-the-art performance on this task to fail.We hope that the observations and contributions made in this paper can aid to improve rigor in future cyberbullying detection work.

Figure 1 :
Figure1: Role graph of a bullying event.Each vertex represents an actor, labeled by their role in the event.Each edge indicates a stream of communication, labeled by whether this is positive (+) or negative (−) in nature, and its strength the frequency of interaction.The dotted vertices were added by[67] to account for social-media-specific roles.

Figure 3 :
Figure 3: SVM baseline and NB-SVM grid values used in hyperparameter search.

Figure 4 :
Figure 4: Left: Top 20 test set words with the highest average coefficient values across all classifiers (minus the model trained on D tox ).Error bars represent standard deviation.Each coefficient value is only counted once per test set.The frequency of the words is listed in the annotation.Right: Test set occurrence frequencies (and percentages) of the top 5,000 highest absolute feature coefficient values.

Table 2 :
Overview of datasets for cyberbullying detection.

Table 3 :
Corpus statistics for English and Dutch cyberbullying datasets, list number of positive (Pos, bullying) and negative (Neg, other) instances, Types (unique words), Tokens (total words), average number of tokens per message (Avg Tok/Msg), number of emojis and emoticons (Emotes), and swear word occurrence (Swears).9

Table 4 :
Cross-corpora positive class F 1 scores for Experiment I (T1), II (T2), and III (T3).Models are fitted on the training proportion of the corpora row-wise, and tested columnwise.The out-of-domain average (Avg) excludes test performance of the parent training corpus.The best overall test score is noted in bold, the best out-of-domain performance in gray. .

Table 7 :
[2]rview of different architectures (Arch) their in-domain positive class F 1 scores for Experiment I (T1) and II (T2), the out-of-domain average for D all (all), and D tox (tox).Baseline model (and scores) is that of Table4.Reproduction results of[2]are denoted by *, their oversampling method by +.Our tuned model versions have no annotation, character level models are denoted by ⋆.

Table 8 :
Positive class F 1 scores for Experiment IV on Dutch data.Models are fitted on the training proportion of the corpora row-wise and tested column-wise.The best overall test score is noted in bold.The scores of primary interest are highlighted in gray.