Keywords

1 Introduction

Online Social Networks (OSNs), or Social Media Platforms (SMPs) as we know them, have accumulated millions of users worldwide [20]. With the exponential growth in the number of accounts and active users on OSNs, it is becoming harder and harder to moderate the content and account activities. While a genuine user and malicious user are being considered in this scenario, we also need to consider informational bots and malicious bots. OSNs have been plagued with many types of malicious bots in recent years. Twitter is a popular microblogging and social networking service with millions of users worldwide. Twitter account holders have the option to follow other accounts, and can have any number of accounts following them. Each account can post status updates with a limit of 280 character in the form of “tweets”. Twitter has gained popularity due to its adaptation by numerous influential figures and regular political coverage. These services offered by Twitter have become a target of social media bots for spreading fake and malicious content online. One of the biggest example of bots spreading fake news and malicious misinformation was during the 2016 U.S. presidential election where Russian bots tried to interfere in the election [23]. Since then, Twitter has taken numerous measures for content moderation by suspending suspicious accounts that spread misinformation, and flagging baseless or questionable tweets.

In the context of OSNs, a social bot or a suspicious user account is a computer algorithm or script that automatically interacts with other accounts and produces content without human input or intervention. There are different types of bots, but we only considered two scenarios where either a bot is malicious, that is, it violates the Twitter community guidelines, or it is an informational bot which is not involved in malicious activities. There can also be different levels of bots such as fully automated bots, partially automated bots, and hacked real user accounts for malicious activities.

Bots or suspicious accounts participate in activities that can seriously harm the integrity of online communities. Previously, there have been numerous studies which tackle the social bots on Twitter with the help of machine learning techniques. There are also some real-time Twitter bot detection platforms such as BotOrNot [7]. Due to these efforts to tackle bots, the bot accounts have started changing their patterns and they are now able to better camouflage themselves such that previous methods are not reliable enough anymore to identify them [14]. This paper focuses on identifying these camouflaged bots with the help of the Benford’s Law, Machine Learning (ML) classifiers, and Statistical Analysis.

In Sect. 2, we review the background with various approaches and machine learning techniques used in the past for bot detection on Twitter with the help of multiple research papers, journals, and articles. In Sect. 4, we go over the methodology, experimental setup, and datasets used to implement out research. In Sect. 5, we discuss the results and observations of our experiments. In Sect. 6, the conclusion of our work is presented, and possible future scopes of this research are explored with clarification.

2 Background

In this Section, we discuss the background of our work with particular emphasis on the Benford’s Law. Moreover, we list the various classification techniques that we trained and tested on the statistics obtained from such law. We also briefly introduce the evaluation methods used. For further information, we invite the reader to the corresponding references.

We experimented using Logistic Regression [25], Naïve Bayes [24], Support Vector Machine [13], Random Forest [3], AdaBoost [22], and Multi-layer Perceptron [19], and evaluated the models with confusion matrix, accuracy, precision, recall, f-measure, and AUC-ROC curve. More details about these evaluation methods can be found in [8]. To validate our classification results we also applied statistical tests such as Pearson’s chi-squared test [21], Kolmogorov-Smirnov test [16], and Mean Absolute Deviation (MAD).

2.1 Benford’s Law

The Benford’s Law, or Newcomb-Benford’s Law, states that in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequencies follow a particular pattern such that they are unevenly distributed and reducing in nature [2, 18]. In 1881, the astrologer Simon Newcomb first observed that the logarithmic tables in the library had their initial pages dirtier and thet were decaying more rapidly than the latter ones [18]. He concluded that the initial digits are more commonly to appear or been used than the latter digits. After 50 years, the physicist Frank Benford re-discovered this lost phenomenon and, later, published a paper titled “The Law of Anomalous Numbers” [2]. For experimentation, he researched on 20 sets of naturally occurring sequences with more than 20,000 samples which included data from sources such as river areas, population, newspapers, addresses, and death rates [2]. All the different datasets tested by him followed the Benford’s Law and can be calculated with the formula in Eq. 1, where P(d) is the predicted value for the digit d.

$$\begin{aligned} P(d) = log_{10} (1 + \frac{1}{d}) \end{aligned}$$
(1)

However, the Benford’s Law does not occur in all datasets. There are certain conditions that a dataset must fulfil to satisfy this property [17]. We list these conditions comparing them with Twitter user accounts information:

  • All digits from 1 to 9 should occur in leading position. In our Twitter datasets all digits from 1 to 9 can be possible FSLDs when we consider following_counts.

  • There should be a greater occurrence of smaller numbers than larger numbers. In our Twitter datasets, the small numbers are more likely to occur than larger numbers when we consider status_counts.

  • The dataset should be natural. Twitter relationships where users follow each other should form organically. There are botnets which will follow a particular user to inflate their followers_counts when paid for the service.

  • There should be no sequence in numbers. Every individual Twitter account has different number of status_counts, following_counts, and followers_counts.

  • No predefined boundaries. Twitter has no maximum or minimum number set for the parameters such as favorite_counts, likes_counts, and status_counts.

  • Different orders of magnitude. Twitter has numbers in tens, hundreds, thousands, and even in millions.

  • Dataset should be large. Twitter has millions of users, hence, a large dataset of users is accessible for research.

Table 1. Benford’s distribution FSLD frequencies [2]

The experimental findings of Prof. Jennifer Golbeck in [11, 12], and Lale Madahali & Margeret Hall in [15], have paved the way for the use of Benford’s Law on the first-degree egocentric network of any social media profile for its Benford Analysis. It has been experimentally proved that first significant leading digits of friend counts of a social media account follow the Benford’s distribution given in Table 1 [11, 12]. If any account does not follow the Benford’s distribution, it can be a suspicious account or malicious bot.

3 Related Work

This Section discusses the previous work of social bot detection on Twitter and analyzes the performance and drawbacks of different approaches with and without application of Benford’s Law. We also show some of the drawbacks of the previous work which complicates the detection of bots on social media platforms.

Twitter was launched in 2006 as a simple mobile app. However, it quickly grew into a full-fledged communication platform. Most of the previous work tackles the problem of social bot detection with supervised machine learning [6].

An example of machine learning applied to Twitter bot detection is the Botometer service, formerly known as BotOrNot service. This is a popular publicly available bot detection tool which produces a real-time social bot score for Twitter accounts [7]. The BotOrNot service was released in May 2014, and it has been developed by researchers from Indiana University at Bloomington. The service is based on a supervised machine learning classifier which leverages over 1,000 features of the target account to produce a classification score or social bot score. According to its algorithm, the higher is the score the more likely is that target account is being controlled by a software. To obtain a feedback from this tool, the target account’s 200 most recent tweets, and 100 recent mentions from other users are required. Its features can be grouped into six main classes, namely, Network, User, Friends, Temporal, Content, and Sentiment. The classifier has been trained on 15,000 manually verified bots, and 16,000 human accounts. The main issue to address is the lack of a standard definition for a social bot. Hence, the labelled datasets used to train the classifier are created by researchers after manual analysis. This technique can introduce bias due to human error. Botometer is accessible through both a web interface and an API endpoint.

The researchers Chu et al. have designed a supervised machine learning classifier to distinguish a target account into three different groups, that is, human, bot, and cyborg [5]. An account classified as human is defined as to have no automated activity, whereas an account classified as bot is fully automated. An account with a mix of automated and non-automated activity is classified as a cyborg. Their classifier is based on four components, namely, entropy, machine learning, account properties, and decision maker. The entropy component is used to recognize automation by detecting a periodic timing for tweeting. The machine learning component is based on a Bayesian classifier to identify text patterns of social spambots on Twitter. The account properties component analyses account information to differentiate humans from bots. Finally, the decision maker component employs Linear Discriminant Analysis on the features shortlisted by other three components to make a classification. The researchers collected their data by crawling on Twitter using the Twitter API, and found that their dataset constitutes of 53% human, 36% cyborg, and 11% bot accounts. The researchers have used a very small dataset for the training of the classifier, and have changed a binary classification problem into multi-class classification problem by introducing cyborgs. As the results in Fig. 1 show, their classifier is effective in identifying humans and bots apart but is less confident when classifying between human and cyborg accounts, or bot and cyborg accounts.

Fig. 1.
figure 1

Confusion matrix on human, cyborg, and bot classification [5].

In 2015, Dr. Jennifer Golbeck from University of Maryland College Park was the first to apply Benford’s Law on the data from OSNs [12]. The author experimented with five major OSNs, namely, Facebook, Google Plus, LiveJournal, Pinterest, and Twitter. They were able to discover that certain features of OSNs, such as the friend’s following_counts, conformed to Benford’s Law , that is, Benford’s Law was applicable to the first-degree egocentric networks of a target profile. Figure 2 shows those statistics. Specifically, the research findings on Twitter dataset indicate that accounts which strongly deviated from Benford’s Law were engaged in malicious or unnatural behavior. The Twitter dataset used for analysis of the Benford’s Law has been made public by the author [10].

Fig. 2.
figure 2

FSLDs for Twitter, Google Plus, Pinterest, Facebook [12].

This discovery from [12] led the author to test the hypothesis that the social connections made by bots are unnatural in nature and they tend to violate Benford’s Law [11]. The author re-investigated the previously discovered Russian bot accounts from 2015, and uncovered a larger Russian botnet with about 13,609 Twitter accounts, out of which 99.6 percent did not conform to Benford’s Law. This study concluded that first significant leading digits of a friend’s following_counts can be utilized to identify anomalous behavior of malicious bots, and it is a significant feature to differentiate between humans and malicious bots. Unfortunately, the author has not made the Russian botnet dataset used in this research public.

In this research, we continue the previous work by combining the promising results of Benford’s Law applied to Twitter bot detection to the theory of machine learning.

4 Implementation

In this Section, we discuss in detail the database and step-by-step pipeline for the implementation of this research. This Section explains the setup followed to train the machine learning models and statistical techniques. Each part of the implementation of this research has been done by using multiple Conda virtual environments. Conda can run on many operating systems, and it is an open-source package and environment management system [1].

4.1 Dataset

Twitter is a global communication network available to the public in real-time. It is used by millions of users daily who generate lots of metadata in the form of short messages, location, @handle, display name, number of followers, number of statuses, number of friends, and more. Twitter metadata can be accessed and retrieved through the official Twitter API. However, Twitter has recently announced rate limits to the API service which reduces the access to the metadata and slows data collection and retrieval. Due to such rate limits, this research dataset was built with the help of four publicly available datasets, namely, anonymizedTwitter [10], botometer-feedback [7], cresci-2017 [6], and gilani-17  [9]. Table 2 shows the number of bots and percentage of human owned accounts for each datasets.

Fig. 3.
figure 3

anonymizedTwitter data samples following Benford’s Law Distribution.

Fig. 4.
figure 4

anonymizedTwitter data samples not following Benford’s Law Distribution.

Table 2. Datasets bot and human label counts 

4.2 Approach

The approach that we implemented is divided into two easy steps, that is, first, preprocessing each dataset and combining them, then, training and testing multiple classifier models and selecting the best model. An overview of our approach is given in Fig. 5 with the application of MLP. We collected the following_counts of all the friends of the profile under scrutiny and extracted the FSLD frequencies to feed our neural network classifier. The input to the models is fixed in size and is equal to 9, that is, every input is the frequency of that particular first significant leading digit. Before outputting the result, the prediction is compared with the majority vote of our statistical tests. This step is necessary only to prove the efficacy of our technique.

Fig. 5.
figure 5

Overview of our approach.

4.3 Data Preprocessing

Since the first dataset, anonymizedTwitter, did not have labels, labelling was performed. Specifically, the FSLDs of each of the 21,135 data samples were extracted from their following_counts and then their frequencies were visualized in the form of a histogram against the Benford’s Law distribution, one sample at a time. Exploratory data analysis was performed on each data sample and a bot or human label was assigned to all samples.

The datasets from [6, 9, 26] were only used to collect the Twitter @handle and the ‘bot’ or ‘human’ label provided by the original authors. Afterwards, the first-degree egocentric network data, that is, the following_counts of each friend for that @handle, was collected with the help of the Twitter API. Once all the first-degree egocentric data from each of the four datasets was available, a new combined dataset was created. This combined dataset only contained the FSLD frequencies of each data sample and a label of 0 for human accounts and 1 for bot profiles. Figure 3 shows the congruity of the anonymizedTwitter dataset samples to the Benford’s Law Distribution (legitimate users), while Fig. 4 shows the samples that are not following such law (possible bots). Figure 6, instead, shows the FSLD frequencies for few samples (one per row) selected from our combined dataset, three bots (label 1) and two human accounts (label 0). It is possible to notice how the rows with bot label 1 are not following the Benford’s Law distribution.

Fig. 6.
figure 6

Final dataset with FSLD frequencies and bot label.

4.4 Training and Testing Classifiers

Once the preprocessing was completed and the combined dataset with labels was available, the data was split into train and test sets with a 75:25 split. Synthetic Minority Oversampling Technique (SMOTE) [4] was used to treat the imbalance between our bot and human samples. We trained and tested six supervised machine learning classifiers, namely, Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, AdaBoost, and Multi-layer Perceptron. Random Forest and AdaBoost models gave high accuracy scores, but the best model was the Multi-layer Perceptron.

5 Results

In this Section, we discuss the results of the experiments performed in our research. The training results for each machine learning model are discussed in detail. The summary of all the experiments’ results are given in Table 3.

Table 3. Naïve Bayes performance 

5.1 Naïve Bayes

The first phase of experiments was the training and testing of Naïve Bayes classifier. This was a naïve approach as the model considers all the features independently with no correlation. With all the nine features evaluated independently the Naïve bayes model obtain detection accuracy of 95.44%, precision of 99.02%, recall of 89.68%, and f-measure of 94.12%. Figure 7 shows the AUC-ROC curve and Confusion Matrix.

5.2 Logistic Regression

In the second phase of experiments, we trained the logistic regression model. Since we have only nine features, we obtained detection accuracy of 99.11%, precision of 98.59%, recall of 99.22%, and f-measure of 98.91%. Figure 8 shows the AUC-ROC curve and Confusion Matrix.

Fig. 7.
figure 7

AUC-ROC curve (a) and Confusion Matrix (b) of Naïve Bayes

Fig. 8.
figure 8

AUC-ROC curve (a) and Confusion Matrix (b) of Logistic Regression

Fig. 9.
figure 9

AUC-ROC curve (a) and Confusion Matrix (b) of Support Vector Machine

5.3 SVM

In the third phase of experiments, we trained and tested a Support Vector Machine (SVM) model on our dataset. The SVM took more time to train that the previous experiments, but it is well suited for our dataset as we have a binary classification problem at hand. We obtained a detection accuracy of 99.82%, precision of 99.56%, recall of 100.00%, and f-measure of 99.78%. Figure 9 shows the AUC-ROC curve and Confusion Matrix.

5.4 Random Forest

In the fourth phase of experiments, we trained and tested a Random Forest classifier which uses multiple decision trees to gain high accuracy. The model trained faster that SVM and has better overall performance compared to all the previous tests. We obtained detection accuracy of 99.91%, precision of 99.83%, recall of 99.95%, and f-measure of 99.89%. Figure 10 shows the AUC-ROC curve and Confusion Matrix.

Fig. 10.
figure 10

AUC-ROC curve (a) and Confusion Matrix (b) of Random Forest

5.5 AdaBoost

The fifth phase of experiments was to push the accuracy score as high as possible, thus, we trained and tested an Adaptive Boosting model. This is another ensemble learning technique, such as Random Forest, that obtained detection accuracy of 99.93%, precision of 99.89%, recall of 99.95%, and f-measure of 99.92%. Figure 11 shows the AUC-ROC curve and Confusion Matrix.

Fig. 11.
figure 11

AUC-ROC curve (a) and Confusion Matrix (b) of AdaBoost Classifier

5.6 Multi-layer Perceptron

The sixth and final phase of experiments was completed by training and testing a feedforward neural network model called Multi-layer Perceptron (MLP). The MLP classifier have an Input and Output layer just like the Logistic Regression model, but it also has hidden layers with neurons to achieve the best possible results. The MLP model obtained the highest accuracy (near perfect) and overall performance. We obtained detection accuracy of 99.98%, precision of 100.00%, recall of 99.95%, and f-measure of 99.97%. Figure 12 shows the AUC-ROC curve and Confusion Matrix.

5.7 Latency Analysis of ML Algorithms

We studied the latency in training our supervised machine learning models on our combined dataset. When the size of the dataset increases, the training time of our models tended also to increase significantly. We trained each of our models on our training dataset of around 25,000 samples. The latency analysis is based on the training time (in milliseconds) taken by each model. Table 4 shows our latency analysis results.

5.8 Statistical Tests Majority Vote

Once the MLP model was trained and selected for its high-performance accuracy, we were ready to test our model on new accounts and then verify our model’s classification based on a majority vote of our three statistical tests. We took four random samples with two bots and two humans and tested our MLP classifier prediction with our statistical tests for majority vote. Here we are testing the goodness of fit of the FSLDs with the Benford’s Law distribution. The hypothesis for tests was formulated as:

Null hypothesis (H0) = Account FSLDs follow Benford’s Law

Alternative hypothesis (H1) = Account FSLDs violate Benford’s Law

If p-value is less than 0.05, then reject H0 (Nonconformity), else accept H0 (Conformity)

The majority vote of our statistical tests validates the prediction results of our MLP classifier and, hence, we proved that the MLP classifier trained on our combined dataset can be used to detect social bots on Twitter (Table 5).

Fig. 12.
figure 12

AUC-ROC curve (a) and Confusion Matrix (b) of Multi-layer Perceptron

Table 4. Latency analysis on training dataset 
Table 5. Statistical tests majority vote 

6 Conclusions and Future Works

The proposed technique, which collects the following_counts of all the friends of the profile under scrutiny and extract the FSLD frequencies to feed our neural network classifier, works best for classifying malicious social bots from human accounts. This is due to the strategic selection of our databases used as ground truth for training our model. The main goal of our research was to create a ground truth dataset from scratch and to train a neural network model on the dataset, while also validating the results with majority vote of statistical tests. The research enables us to identify if a Twitter user is a malicious social bot or human with high level of confidence. The overall research technique is novel, and has never been implemented in this fashion as far as the authors are aware.

Any supervised machine learning technique used for bot detection will only be as good as the ground truth data that was used to train it. As the social bots keep changing their patterns and techniques rapidly, even the most sophisticate bot detection algorithms fail as their training rules become outdated. Benford’s Law is an unavoidable naturally occurring phenomenon and it is prevalent on Twitter’s data. Since the malicious bots break away from the natural pattern by synthetically following other social bots and malicious accounts, they tend to unknowingly violate the Benford’s Law. Hence, our technique will be able to identify malicious bots or suspicious accounts even if the bot behavior patterns keep evolving.

However, there are certain limitations to take in considerations. To analyze any account on Twitter, we need the account to be following at least 100 other accounts. The detection accuracy, in fact, deteriorates if the account has less friends. Furthermore, Benford’s Law requires orders of magnitude and certain number of samples to work with the FSLD frequency distribution.

6.1 Future Works

The research can be extended to create a web browser plug-in where the users is able to classify Twitter accounts in real-time without gathering all the first-degree egocentric data and feeding it to our model. The browser extension could be able to add graphical and textual information on the Twitter page to warn the user about any suspicious profile that is encountered during regular activity.

Another extension to this research would be the employment of other techniques to identify if the suspicious account is part of a bigger botnet.

This same technique can also be applied on Facebook datasets, or other social media platforms, to test if we can successfully classify bots on such platform with the help of the Benford’s Law.