1 Introduction

In research areas such as machine learning, pattern recognition, and data mining, class imbalance is one of the problems that have recently gained the most attention [1]. Dataset with an unequal class distribution is technically imbalanced; the minority class is represented by a very small number of instances in contrast to the other the majority class.

It is known that cyberbullying could have a negative impact on the people's life in many ways. Machine learning can be effective in detecting the bullies' language patterns and can also create a model for cyberbullying actions to be detected [2]. It is difficult to conduct some machine learning research in the field of cyberbullying, not because of the lack of accurately labeled datasets, but also because all available datasets suffer from a class imbalance where the majority (not bullying) class usually greatly outnumbers the minority (bullying) class [3, 4]. In cyberbullying datasets, the percentage of normal labeled classes is higher than the percentage of abnormal labeled ones, which is called as class imbalance problem in data mining. If training dataset is imbalanced, the classification algorithm generally predicts the labels of the majority class instances correctly and the minority class instances incorrectly which leads to a major problem for cyberbullying detection systems [5, 6].

Class imbalance is a persistent challenge in data mining, particularly in the context of cyberbullying detection systems. Resampling techniques have been proposed as a potential solution to this issue, with data preprocessing considered essential for building effective models using modern data mining algorithms. Among the simplest sampling methods are random undersampling and random oversampling. The former involves selecting random samples from the majority class to be deleted, while the latter randomly duplicates minority class samples. However, these techniques are not without their limitations, as undersampling can result in information loss, while oversampling can lead to over-fitting. To overcome these issues, various alternative techniques have been proposed.

Nitesh V. Chawla et al. [7] have developed an oversampling technique called synthetic minority oversampling technique (SMOTE). The experiments of this technique were performed using C4.5, Ripper, and naive Bayes classifier algorithms. The obtained results of applying this approach showed that the accuracy of classifiers of minority class is improved. Because of this success, the algorithm has been used in many areas of data mining. The minority class in the datasets affects the classification accuracy of classification algorithms. A clustering-based undersampling technique was developed by Yen et al. [8] to enhance the classification accuracy for minority class. Experimental results demonstrate that other undersampling techniques are outperformed by clustering-based undersampling techniques. Classification methods developed by researchers are used in many important areas. To increase the classification accuracy of medical datasets, Li et al. [9] used oversampling and undersampling strategies. Liu et al. [10] tested that the oversampling and undersampling techniques on the imbalanced text dataset affected the performance and classification accuracy. In this study, we investigate the effect of four resampling techniques in the performance of four cyberbullying datasets.

The paper is organized as follows: Sect. 2 presents the related work, providing a comprehensive overview of the existing literature. In Sect. 3, the background is provided, covering resampling techniques, classification algorithms, and performance measures relevant to the study. The methodology employed in this research is detailed in Sect. 4. Section 5 presents the results obtained from the experiments and discusses the key findings. A thorough discussion of the results is provided in Sect. 6, while Sect. 7 concludes the paper with final remarks, limitations, and future work.

2 Related work

In this section, we look at few studies that used machine learning methods and resampling strategies to solve the problem of an imbalanced dataset.

Kubﻞt and Matwin [10] suggested removing noisy and duplicated training data by using a one-sided selection method that decreases the majority class. A SMOTE algorithm was proposed by Chawla et al. [7] to increase minority classes. The benefit of using SMOTE is reducing over-fitting problem because synthetic examples are made instead of replicating instances, and this is caused by random oversampling. Furthermore, there is no loss of important information; thus, the decision areas become broader and less specialized.

Naseriparsa et al. [11] suggest a new hybrid approach in which a combination of resampling, sample domain filtering, and genetic search wrapper subset evaluation method has been used to minimize the Lung-Cancer dataset dimensions derived from the UCI Machine Learning Repository databases. Khaldy and Kambhampati [12] highlighted the challenge of implementing five alternative feature selections and illustrated the usefulness of imbalanced class for the high-dimensional dataset. In medical datasets, Mehmet and Mohammed [13] looked at the effects of oversampling and undersampling techniques. Several medical benchmark datasets and well-known classification techniques are used. Experimental results show that oversampled datasets can learn more efficiently and predict patient instances more successfully.

Regarding the study of imbalance problem in a cyberbullying case, Colton and Hofmann [4] examine the performance of a prediction model whether it is affected by using resampling strategies or not. A compromise method is also investigated, in which the positive class is partially oversampled and the negative class is partially undersampled. Although sampling using the most often seen features was not exactly a class imbalance solution, it was investigated.

Talpur and O'Sullivan [6, 14] have recently addressed the issue of class imbalance in cyberbullying datasets. For their research, they used the SMOTE oversampling approach. The results revealed that when the SMOTE parameter was enabled, the base classifier's overall performance improved marginally as it dealt with the distribution of class imbalance.

Table 1 presents an overview of various rebalancing techniques along with their advantages, disadvantages, cost considerations, and impact on classification performance. Six different techniques are discussed, including random oversampling, random undersampling, SMOTE, ADASYN, TOMEK Links, and cost-sensitive learning.

Table 1 An overview of various rebalancing techniques

Table 2 lists several datasets related to cyberbullying, along with some key information about each dataset. The first column lists the study or source of the dataset. The second column specifies the platform where the data was collected from, such as Formspring, Twitter, Instagram, and MySpace. The third column indicates the language of the data, which is mostly in English but also includes Dutch and Japanese. The fourth column shows the size of each dataset, which ranges from 1340 to 1,570,000 instances. Some datasets are balanced, meaning that the proportion of positive and negative instances is roughly equal, while others are imbalanced, with a higher proportion of one class compared to the other.

Table 2 Cyberbullying datasets

The fifth column provides information about the balancing of each dataset, specified as a decimal value between 0 and 1. For example, a value of 0.142 in the balancing column means that 14.2% of instances in the dataset belong to the positive class, while the remaining 85.8% belong to the negative class. Table 2 provides useful information for researchers interested in studying cyberbullying and developing machine learning models to detect and prevent it.

Table 3 specifically focuses on previous work done in Arabic cyberbullying detection. The first column lists the study or source of the dataset. The second column specifies the dataset used for the study, which includes Twitter in Arabic and English and Aljazeera.net. The third column indicates the feature representation used in the study, which includes SentiStrength Feature Vector, word embeddings, TF-IDF, and n-gram. The fourth column lists the classifier used in each study, which includes naive Bayes, SVM, KNN, random forest, logistic regression, and convolutional neural networks (CNN) and recurrent neural networks (RNN). The fifth column provides the performance metrics of each classifier, such as accuracy (Acc), precision (P), recall (R), and F1 score (F), which are commonly used in machine learning to evaluate the quality of a classifier's predictions. Table 3 provides an overview of the different approaches used in Arabic cyberbullying detection studies and their corresponding performance metrics, which can be useful for researchers working in the field.

Table 3 The previous work done in Arabic cyberbullying detection

3 Resampling techniques

This study aims to investigate the impact of four resampling techniques, namely random undersampling, random oversampling, SMOTE, and hybrid (SMOTE and TOMEK Links), on unbalanced cyberbullying datasets, and these techniques can be summarized as follows:

  1. 1.

    SMOTE (Synthetic minority oversampling technique)

    SMOTE is a popular oversampling technique used to address class imbalance in datasets. Class imbalance occurs when one class has significantly fewer instances than the other, leading to biased learning algorithms. SMOTE helps alleviate this issue by generating synthetic examples for the minority class, thereby balancing the class distribution.

    The basic idea behind SMOTE is to create synthetic instances by interpolating between existing minority class instances. Here is how it works:

  • For each minority class instance, SMOTE selects its k-nearest neighbors in the feature space.

  • Synthetic instances are generated by randomly selecting one of the k neighbors and creating a new instance by interpolating between the selected neighbor and the original instance.

  • This process is repeated until the desired level of oversampling is achieved.

    SMOTE effectively increases the number of minority class instances, making the dataset more balanced and improving the performance of learning algorithms. However, it does not address potential overlapping or noisy samples that might exist in the dataset.

  1. 2.

    SMOTE + TOMEK Links:

    SMOTE+TOMEK Links is a hybrid resampling technique that combines the SMOTE oversampling method with the TOMEK Links undersampling technique. The goal of this combination is to not only increase the number of minority class instances but also remove potential noisy samples and enhance the separation between different classes. TOMEK Links are pairs of instances from different classes that are close to each other but considered to be ambiguous or noisy. By removing these instances, TOMEK Links aim to improve the decision boundary between classes. Here is how SMOTE+TOMEK Links works:

    • First, SMOTE is applied to oversample the minority class and generate synthetic instances.

    • Next, TOMEK Links are used to identify pairs of instances with different class labels that are close to each other.

    • For each identified TOMEK Link, the instance from the majority class is removed.

    • The resulting dataset consists of the augmented minority class instances and the remaining majority class instances.

By combining SMOTE and TOMEK Links, this approach helps to both increase the representation of the minority class and address potential noisy samples, resulting in a more balanced and better separated dataset. This, in turn, can lead to improved classification performance and more reliable predictions. SMOTE and SMOTE + TOMEK Links are valuable techniques for handling class imbalance in datasets, and they have proven to be effective in various machine learning applications.

3.1 Undersampling techniques

By randomly deleting examples of the majority class, undersampling techniques attempt to balance class distribution. This is repeated until the dominant and minority classes' situations are equalized. Some of the undersampling approaches that are more commonly used and applied include random undersampling (RUS) and TOMEK Links undersampling [47].

3.2 Oversampling techniques

Oversampling methods either replicate or create new instances in the minority class. Oversampling strategies includes many techniques such as random oversampling (ROS) and synthetic minority oversampling technique (SMOTE) [47].

3.3 Hybrid techniques

While oversampling or undersampling techniques can be effective when applied individually to a training dataset, a combination of both techniques can yield a model that better fits the overall results on the transformed dataset. SMOTE is the most popular and widely used oversampling technique and is often paired with one of several undersampling techniques. The following are some of the frequently used and implemented combinations of data sampling methods [47]

  • The most basic combination is SMOTE with random undersampling, which has been shown to outperform SMOTE alone.

  • SMOTE with TOMEK Links and SMOTE with Edited Nearest Neighbors Rule are used to remove noisy points from both classes at the class boundary, which appears to improve classifier performance on the altered data.

Table 4 illustrates the advantages and disadvantages between the four resampling that we use in this study [48].

Table 4 Advantages and disadvantages of used resampling techniques

4 Methods

Figure 1 depicts the methodology employed in this study. The process includes dataset selection, data preprocessing, application of machine learning classifiers, and evaluation of performance measures. Details on each of these steps are provided in the following.

Fig. 1
figure 1

Proposed method for this study

4.1 Datasets

This study is implemented using four imbalanced cyberbullying datasets. These datasets have been found publicly at [49], and they contain different sizes and different imbalance ratios. Table 5 gives the original distribution of the data in terms of the source, size of the dataset, number of their majority and minority instances, and their imbalance ratio (IR).

Table 5 Cyberbullying dataset used

4.2 Dataset preparation

Preprocessing and resampling techniques play a crucial role in data analysis and machine learning tasks. These techniques are employed to preprocess and manipulate the data before feeding it into a learning algorithm, with the aim of improving the quality and reliability of the results. Let us explore these techniques in more detail:

  1. 1.

    Preprocessing techniques:

    Preprocessing involves a series of steps to transform and prepare the data for analysis. Some common preprocessing techniques include:

    • Data Cleaning This involves handling missing data, removing outliers, and dealing with inconsistent or erroneous values. It ensures the data are accurate and reliable.

    • Feature Scaling It is important to scale features to a consistent range to prevent certain features from dominating the learning process. Common scaling methods include standardization (mean of 0 and variance of 1) and normalization (scaling to a specified range).

    • Feature Encoding Categorical variables often need to be converted into numerical representations for machine learning algorithms to process. Techniques like one-hot encoding and label encoding are commonly used for this purpose.

    • Dimensionality Reduction When dealing with high-dimensional data, dimensionality reduction techniques like principal component analysis (PCA) or feature selection methods can be applied to reduce the number of features while retaining important information.

  1. 2.

    Resampling techniques:

    Resampling techniques are used to address class imbalance issues in the dataset, where the number of instances in one class significantly outweighs the number in another class. Some commonly used resampling techniques include:

    • Oversampling This involves increasing the number of instances in the minority class by duplicating or generating synthetic samples. Techniques like SMOTE (synthetic minority oversampling technique) generate synthetic examples by interpolating between existing minority class instances.

    • Undersampling This technique aims to reduce the number of instances in the majority class by randomly selecting a subset of instances. Undersampling can be effective when the majority class has a large number of redundant or similar instances.

    • Hybrid Approaches These techniques combine oversampling and undersampling to achieve a more balanced dataset. For example, one popular approach is SMOTE combined with TOMEK Links, where synthetic samples are generated for the minority class, and TOMEK Links are used to remove noisy samples from both classes.

These preprocessing and resampling techniques are essential for preparing data for effective machine learning model training. By properly handling data cleaning, scaling, encoding, and addressing class imbalance, these techniques contribute to more accurate and reliable predictions, ultimately enhancing the performance of machine learning models.

In data preparation, standard preprocessing steps are implemented. The following steps are performed for each dataset:

  1. 1.

    Remove stop words.

  2. 2.

    Normalization.

  3. 3.

    Stemming.

  4. 4.

    Transform text data to numerical via vectorizing it and calculate the (TF_IDF).

4.3 Used resampling techniques

To investigate the impact of resampling techniques, the four methods (random undersampling, random oversampling, SMOTE, and SMOTE+TOMEK) are applied to each dataset, resulting in a balanced dataset that is then used in the classification phase.

4.4 Machine learning classification algorithms

After resampling the dataset, the dataset was split as 80 % for training and 20 % for testing. Then, it was passed to the classification phase in which eight machine learning classifiers were used (multinomial NB, Bernoulli NB, logistic regression, SGD classifier, SVC, linear SVC, decision tree classifier, and random forest classifier). The choice of optimizer depends on factors such as the problem, dataset characteristics, and training requirements. SGD (stochastic gradient descent) is often chosen for its computational efficiency, scalability to large datasets, noise tolerance for better generalization, flexibility in hyperparameter tuning, and suitability for online learning. However, it has limitations and may require careful tuning. Experimentation with different optimizers is recommended to find the optimal choice, considering factors such as network architecture, dataset, and computational resources.

4.5 Performance measures

Simpler measurements, such as accuracy score, can be misleading. As a result, we calculate the confusion matrix and use the accuracy, precision, recall, and F1_score metrics for each classifier to assess its performance.

5 Experimental results

This section report results from a selection of experiments on the classification of cyberbullying datasets under different scenarios of resampling. All the experimental analyses were implemented using the Python library imbalanced-learn. It is compatible with Scikit-learn. The Scikit-learn is a machine learning module that provides simple and efficient tools for data mining and machine learning. The machine learning classification algorithms used in our investigation are multinomial NB, Bernoulli NB, logistic regression, SGD classifier, SVC, linear SVC, decision tree classifier, and random forest classifier. The performances of the eight algorithms for the four datasets are assessed and compared. In order to identify the best classification algorithm and the best resampling technique, the algorithms are compared with to their performance. The values obtained for each dataset are shown in the following tables. Hulse et al. [50] suggest that the utility of the resampling methods depends on several factors, including the ratio between positive and negative examples, other characteristics of data, and the nature of the classifier.

Table 6 illustrates the effect of the used resampling techniques on the performance measures (accuracy, F1 score, recall, and precision) for all classifiers in the first dataset (Kaggle). First for the unbalance dataset, we can see that logistic regression and SVC outperform the other classifiers for all performance measures except F1 score the SGD classifier is the higher. Logistic regression can be competitive in the case of highly unbalanced data [51].

Table 6 Performance evaluation of classifiers with resampling techniques for the Kaggle dataset

There are large amounts of data discarded in random undersampling. This can be extremely troublesome, as the lack of such data can make it more difficult to learn the decision boundary between minority and majority instances, resulting in a loss of classification results [51, 52]. From Table 3, we notice a decrease in all classifier's performance except the multinomial NB and the SGD classifier which achieve the best performance.

For all classifiers, ROS technique has higher for all performance measures values than other methods. Random forest gets the highest performance between all classifiers. This technique can affect models that seek good splits of the data, such as support vector machines and decision trees [53]. Comparing decision tree performance in this technique with the original and undersample, we notice an improvement in its performance. The main drawback with oversampling is increasing the likelihood of over-fitting since it duplicates the minority class events. A second drawback is increasing the learning time as it increases the number of training examples.

The problem of over-fitting caused by random oversampling is prevented in SMOTE as synthetic examples rather than replication of instances are created. Also, there is no loss of useful information. After using SMOTE we notice that the recall on the minority class increased while maintaining a high precision on the majority class which is desired by classification algorithms. For example, Table 7 illustrates the SVC classification report before SMOTE and after. One can notice that the recall for the minority class1 is increased, while the precision for the majority class0 is increased. Random forest and SVC outperform the other classifiers for all measures.

Table 7 Comparison between recall and precision before and after SMOTE

SMOTE does not take neighboring examples from other groups into consideration when creating synthetic examples. This can lead to increased class overlap and additional noise can be added, so the hybrid (SMOTE with TOMEK) was used. The results showed an improvement occurred to all performance measures than using SMOTE alone for most classifiers as (SMOTE with TOMEK) removes noisy points along the class boundary from both classes, which seems to have the effect of the better performance of classifiers fit on the transformed dataset. The combination was shown to provide a reduction in false negatives at the cost of an increase in false positives for a binary classification task. The hybrid method improved recall and lowered the FN/FP ratio for every classifier, indicating improved sensitivity to cyberbullying [54]. SVC outperforms the other classifiers for all measures.

Table 8 illustrates the performance of resampling techniques for the Twitter 1 dataset. For the unbalance dataset, one can notice that SVC and linear SVC outperforms the other classifiers. For the RUS technique, one can notice a decrease in all performance measures for all classifiers. As mentioned before, this is because there are large amounts of data discarded in random undersampling. This can be extremely troublesome, as the lack of such data can make it more difficult to learn the decision boundary between minority and majority instances, resulting in a loss of classification results. [51, 52].

Table 8 Performance evaluation of classifiers with resampling techniques for the Twitter1 dataset

Unlike RUS, the ROS technique leads to no information loss and outperforms RUS. ROS achieves the highest performance than other methods for all performance measures values. SVC gets the highest performance between all classifiers. As mentioned before, the main drawback with oversampling is increasing the likelihood of over-fitting since it duplicates the minority class events. A second drawback is increasing the learning time as it increases the number of training examples. For that, we use the SMOTE technique.

The results of the SMOTE technique showed that it achieved a high recall on the minority class while maintaining a high precision on the majority class, which is desirable for classification algorithms. The classification report for the SVC before and after SMOTE showed an increase in recall for the minority class and an increase in precision for the majority class. Additionally, some classifiers showed an improvement in precision and F1_score compared to the original dataset. The SVC classifier achieved the highest performance among all classifiers (Table 9).

Table 9 Comparison between recall and precision before and after SMOTE

Regarding the hybrid (SMOTE with TOMEK), as in the previous dataset the results showed an improvement occurred to all performance measures than using SMOTE alone for most classifiers as (SMOTE with TOMEK) eliminates noisy points from both classes along the class boundary, which achieves a better performance of classifiers on the transformed dataset. Table10 illustrates the performance of resampling techniques for the Twitter 2 dataset. For the original unbalanced dataset, the size is 10,971 which is larger than the two previous datasets. Our empirical results, consistent with [55, 56], confirm that size of the training set and the classification rate are indeed correlated. Although these algorithms perform relatively well with small datasets, all used classifiers show a major improvement in performance as the number of cases increases, indicating a more consistent learning method. All classifiers work well compared to previous datasets. Random forest achieves the best performance. Regarding RUS results, although the performance measures are decreased than in the original unbalanced dataset, we notice that precision is higher than the recall for most of classifiers. Higher precision means that an algorithm returns more relevant results than irrelevant ones. Linear SVC than SGD achieves the best performance.

Table 10 Performance evaluation of classifiers with resampling techniques for the Twitter dataset

As noticed before, ROS achieved the best results between the used sampling techniques. All performance measures for all classifiers increased than for the original and the RUS technique. Random forest is the best classifier.

Regarding the SMOTE technique, as in the previous datasets we notice that a high recall on the minority class while maintaining a high precision on the majority class which is desired by classification algorithms. For example, Table 11 illustrates the linear SVC classification report before SMOTE and after. One can notice that the recall for the minority class1 is increased, while the precision for the majority class0 is increased.

Table 11 Comparison between recall and precision before and after SMOTE

As observed in the previous datasets, the combination of SMOTE with TOMEK Links leads to improved performance in certain classifiers (namely SGD, linear SVC, and random forest) when compared to using SMOTE alone. This can be attributed to the fact that SMOTE with TOMEK Links removes noisy data points from both classes along the class boundary, leading to better classifier efficiency on the transformed dataset. Among all the classifiers used, random forest exhibits the highest performance.

Table 12 illustrates the performance of resampling techniques for the YouTube dataset. For the original unbalanced dataset, one can notice that the size is 2745 less than the previous three datasets and the imbalance ration is 6.53: 1 which is high than previous datasets. Hulse et al. [52] suggest that the utility of the resampling methods depends on a number of factors, including the ratio between positive and negative examples, other characteristics of data, and the nature of the classifier. For the datasets with a severe imbalance, our experiments consistent with [57], one can observe that all the resampling techniques improve the recall of the minority class (TP rate) and the precision of the majority class except for the RUS technique because of information loss that occurred by reducing training data sample size. See Table 13 (linear SVC as an example).

Table 12 Performance evaluation of classifiers with resampling techniques for the YouTube dataset
Table 13 Recall and precision for both classes for all resampling techniques

Table 13 provides the recall and precision values for different classes (class0 and class1) obtained from the inference of a linear support vector classifier (linear SVC) using different resampling techniques (None, RUS, ROS, SMOTE, SMOTE + TOMEK). The recall measures the ability of a model to correctly identify positive instances, while precision measures the proportion of correctly identified positive instances out of all instances predicted as positive.

The inference results can be summarized as follows:

For class0:

  • None: The model achieved a recall of 0.97 and a precision of 0.89.

  • RUS (random undersampling): The model achieved a recall of 0.61 and a precision of 0.58.

  • ROS (random oversampling): The model achieved a recall of 0.96 and a precision of 1.0.

  • SMOTE (synthetic minority oversampling technique): The model achieved a recall of 0.78 and a precision of 0.92.

  • SMOTE + TOMEK: The model achieved a recall of 0.79 and a precision of 0.92.

For class1:

  • None: The model achieved a recall of 0.13 and a precision of 0.34.

  • RUS: The model achieved a recall of 0.62 and a precision of 0.65.

  • ROS: The model achieved a recall of 1.0 and a precision of 0.96.

  • SMOTE: The model achieved a recall of 0.47 and a precision of 0.22.

  • SMOTE + TOMEK: The model achieved a recall of 0.48 and a precision of 0.23.

These values indicate the performance of the linear SVC model using different resampling techniques. It is important to analyze both recall and precision together to assess the effectiveness of the model in correctly identifying positive instances and minimizing false positives.

Regardless to RUS technique results, one can notice that although it is the best method at the time of classification, it reduces the performance, as by decreasing the training data it can discard potentially useful information which could be important for building rule classifiers. Linear SVC is the best performance with RUS. As all of the studied datasets, ROS achieves the best performance. SVC outperforms the other classifiers by 0.99 for all measures. It mitigates the problem of over-fitting caused by random oversampling SMOTE and SMOTE + TOMEK is used to have the effect of the better performance of classifiers fit on the transformed dataset. Both SMOTE + TOMEK and SMOTE seems to be similar with little variations.

6 Discussion

The results of this study suggest that the performance of resampling techniques for cyberbullying datasets depends on several factors, including dataset size, class imbalance ratio, and classifier used. The findings also indicate that no single resampling technique consistently outperforms the others. Therefore, selecting the appropriate resampling technique for a given dataset requires careful consideration of these factors.

One important finding of this study is that classifiers trained on balanced data through resampling are more reliable than those trained on unbalanced data. This underscores the importance of addressing class imbalance in cyberbullying datasets to improve classifier performance. Oversampling and undersampling were found to have different effects on training time, with oversampling leading to longer training times and undersampling reducing them. Therefore, the choice between these two techniques may depend on the available computational resources and time constraints.

Another important finding is that all resampling techniques improve the recall of the minority class and the precision of the majority class when the data are extremely imbalanced. This is particularly important in the context of cyberbullying, where detecting the minority class (i.e., cyberbullying attacks) is crucial for effective prevention and intervention.

Resampling techniques were also found to detect more minority data, especially through oversampling, and improve accuracy by reducing the extent of imbalance. However, it is important to note that resampling cannot improve accuracy if the inaccuracy is not related to imbalance. This highlights the importance of carefully evaluating the reasons for inaccuracies in classifier performance before applying resampling techniques.

The hybrid method SMOTE + TOMEK was found to improve recall and reduce the FN/FP ratio for every classifier, indicating improved sensitivity to cyberbullying. However, the RUS technique, although effective during classification, was found to reduce overall performance by discarding potentially useful information. Linear SVC performed best with RUS, while all studied datasets achieved the best performance with ROS. SVC outperformed other classifiers by 0.99 for all measures. SMOTE and SMOTE + TOMEK were found to mitigate the problem of over-fitting caused by random oversampling and had similar results with small variations.

7 Concluding remarks

In this paper, we conducted an investigation into the impact of four resampling techniques on the performance of eight classifiers for cyberbullying datasets. Our findings revealed that the effectiveness of resampling techniques is influenced by various factors, including dataset size, imbalance ratio, and the specific classifier employed. No single technique consistently outperformed the others. Notably, classifiers trained on balanced data through resampling exhibited greater reliability compared to those trained on unbalanced data. We observed that oversampling increased training time, while undersampling decreased it. In cases of extreme data imbalance, all resampling techniques enhanced the recall of the minority class and the precision of the majority class. Resampling, particularly through oversampling, facilitated the detection of additional minority data, leading to improved accuracy by reducing the extent of imbalance. However, it is important to note that resampling alone cannot enhance accuracy if the inaccuracies are unrelated to class imbalance. Among the resampling techniques, the hybrid method SMOTE + TOMEK displayed notable improvements in recall and a reduced FN/FP ratio across all classifiers, indicating enhanced sensitivity to cyberbullying instances. Although the RUS technique demonstrated effectiveness in classification, it resulted in an overall performance reduction by discarding potentially valuable information. Linear SVC exhibited the best performance when combined with RUS, while all studied datasets achieved optimal results with ROS. Furthermore, SVC outperformed other classifiers across all measures by a margin of 0.99.

It is essential to acknowledge the limitations of this study, such as the limited exploration of only four resampling techniques and eight classifiers. Other resampling techniques and classifiers may prove effective for cyberbullying datasets. Additionally, our investigation focused solely on binary classification, leaving room for future research to explore multiclass classification for cyberbullying datasets. Furthermore, our study exclusively examined the impact of resampling techniques on cyberbullying datasets, while future work should investigate their effectiveness on other types of datasets with class imbalance. To expand on the research in this field, we propose several avenues for future work. Firstly, the exploration of alternative resampling techniques and classifiers specifically tailored to cyberbullying datasets would be valuable. Secondly, investigating multiclass classification methods could enhance classifier performance for cyberbullying datasets. Thirdly, extending the investigation to other types of datasets with class imbalance, such as medical datasets, would provide broader insights. Additionally, combining multiple resampling techniques could be explored to further improve classifier performance. Lastly, the effectiveness of combining resampling techniques with other approaches, such as feature selection or extraction, for enhancing classifier performance in cyberbullying datasets warrants investigation.