1 Introduction

Today we are living in a world, where people have an active participation in online platforms of social interaction. Some kind or other, online social networks are part of our daily lives. The various types of social media platforms provide different types of services ranging from sharing personal views, collaborating with others, spreading the information of interest, exploring new ideas, discussing real-life events, and participating in evolving communities. Every social media network has a unique purpose, for example, Facebook is primarily used to connect with family and friends, Linkedin is used to connect with people from the professional circle, Instagram is used to share multimedia content, Pinterest is used to explore interesting pins of others and Tumblr is used to find and follow blogs from various categories [4, 49].

In the last 10 years, social media analysis has shown a growth in research studies ranging from ROI for organizations, prediction of real-life changes influenced by social media, descriptive analysis of real-life events as discussed on online platforms [10], viral marketing, social issues, health issues, natural disasters, emergencies, online surveys, countering fake information, detecting cyber bullying and use of abusive language, e-learning, online monitoring, etc.

In the research area of social media analysis, Twitter is a very popular choice of researchers because of its simple method of accessing data using an API interface. The raw feed from Twitter API is very rich in information, in terms of tweet content features and user profile features. The real potential of getting data from API is that it can be used for real-time data analysis and also for batch processing of a huge amount of data [4, 37].

1.1 Motivation

To study the activities of online users and to understand the behavioral pattern of users in various research domains noteworthy efforts are being made in the past few years [29, 35, 36, 38, 44]. The online activities make every user unique from other users which will become visible as strong patterns in time. The behavior patterns or signature style of a user is very useful in authentication, identification, and access control applications [19].

User centric approach

The proposed work is an attempt to predict retweets from the point of view of a single user. A Twitter user is the only identity who will take action using free will. As shown in the Fig. 1, a Twitter user receives a huge amount of information from various sources. This information overload has a very deep impact on user actions. A user can not consume the sheer amount of information at the same pace as it arrives. This will leads to a situation where a user may take an active action or a passive action on the current piece of information. All the actions where a user generates some new content fall under the active actions and all the actions where a user does not generate new content come under passive actions. The action of retweeting comes under the category of passive action because without adding any new information, a user let the existing information flow towards its followers in the network.

Fig. 1
figure 1

Twitter user as information processing node

These actions form the basis of user behavior and all these actions get recorded in the user profile. For example, a user profile contains information about how many tweets have been posted by a user (active action), and how many tweets are marked as favorites (passive action). The number of tweets retweeted by a user is not recorded in the profile and hence, it is a research problem of retweet prediction by analyzing the all other actions performed by the user from the time a user account has been created.

User data and twitter dataset

The problem of reproducing a Twitter dataset is a major issue for user behavior analysis. The public datasets, release only tweet content features or sometimes just TweetIDs. The challenge of hydrating the dataset from TweetIDs after 4 years results in a loss of 30% dataset [48]. The terms and conditions of Twitter API do not allow fetching user profiles from TweetIDs. The proposed work is an attempt to provide an alternative way to handle this problem by using public Twitter archives [24, 40].

The Fig. 2 has shown three layers of features which can be used for the retweet prediction. The first layer consists of user features which are available with every tweet collected using API. The numerical features can be used as it is and some features can be computed with basic mathematical operations. The second layer i.e. tweets’ content features are partially available in API and more features can be created using complex algorithms such as NLP features. The third layer of features is not directly available in the random feed of tweets. These features must be generated using various methods of data collection, complex algorithms and different assumptions about the structure of network. The recent studies have used various combinations of features from all three layers. However, those methods are not reproducible because user information cannot be shared publically.

Fig. 2
figure 2

Twitter API and availability of Features

User profiles are the most significant part of the user behavior analysis, and easily available with every tweet coming from random feed. Zubiaga et al. [48] found that the most common method of data collection from Twitter is using Twitter streaming API. The use of limited features available in Twitter API can be one of the solutions to generate domain independent, language independent and general purpose analysis on very large datasets. Recent studies [24, 48] have found that due to concerns of user privacy and restrictions imposed by social media companies on the distribution and sharing of dataset makes it very difficult to reproduce the same dataset for social media analysis [48].

A study [40] on the comparison of Twitter datasets and Twitter archives suggested that freely available archives should be used as an alternative way to reproduce and distribute datasets. The available archives are collections of the live feed of random tweets captured using Twitter API. Each tweet contains all data fields available in API as a JSON document. The significance of using archives is that it contains the full user profile along with tweet content features.

1.2 Significance of proposed work

Objectives:

  • To provide a baseline pattern of retweet prediction (using 100 million random tweets) for domain-independent data feed with a minimum feature set and low computation requirement.

  • To propose a method for user behavior research that is reproducible, scalable, and using a public dataset without violating the terms and conditions of Twitter API.

  • To reduce the complexity of social media analysis for big data streams using basic numerical features.

  • To predict the retweets for every random user irrespective of the fact if a user is a normal user or a influencer/celebrity user.

Conditions:

  • The dataset contains a random feed without any specific domain, topic, or other conditions.

  • The proposed feature set is created from features available in Twitter streaming API only.

  • The dataset, containing full user profiles, is freely available for research.

  • The feature set includes only numerical values for fast processing and to reduce the computational complexity of text features.

Outcomes:

  • The user profile features performed better than tweet content features for retweet prediction.

  • The basic numerical features are very useful for real time user behavior analysis.

  • No preprocessing requirement for proposed features set makes it fast and scalable for processing of big data streams.

  • The proposed features set have shown promising results for regression and classification algorithms.

  • The proposed work is able to predict for every user profile, influential or normal user.

In the following sections, the article is divided as follows. The related work on retweet prediction is given in section 2. In section 3 authors described the methodology of the study. The evaluation of the proposed work using Machine Learning Algorithms is presented in section 4. Section 5 comprises of Conclusions and the future scope of this study.

2 Related works

To understand the user behavior, one interesting research question is, why a user shares few tweets within network and not all of them. The probable reason can be due to information overload, it is practically not possible for a user to keep sharing every incoming tweet. Hemsley [25] found that approximately 47% tweets did not get retweets [14]. It presents an opportunity to study and analysis various factors of user actions to predict information sharing behavior.

Recent studies on information sharing proposed various methods to answer these questions. The studies focused on the content of tweets used sentiment analysis, location-based features, NLP techniques, use of hashtags (#), cashtags ($), URLs, and various text-based statistical features [10, 22, 26, 45]. The text-based approaches demand heavy computational resources and also in some cases all past tweets of the user [10, 23, 27, 43, 47]. The tradeoff between accuracy and computational resources is the bottleneck to scale up for big data analysis and real-time analysis of live data streams.

The graph-based approaches are commonly limited to well-defined network boundaries and some static assumptions about the growth of the network [8]. In reality, to replicate these studies is a very big computation challenge and also very difficult to produce the same accuracy every time due to evolving network structure.

The retweet cascade techniques need data for first k retweets or the first 5–10 min window of temporal features for retweet prediction. The problem with this method is that the time stamp and user profile of each retweeter is needed to create a retweet cascade for every single tweet. These approaches are not useful for live feed data, because it is not possible to monitor every single tweet for its upcoming retweets before starting predicting [14, 18, 2831, 4647].

Retweet prediction is a very popular way of understanding the dynamics of information sharing on Twitter. In recent years, various combinations of features have been proposed for more accurate retweet prediction. The features range from simple statistical features to more complex features including language-specific NLP features, network structure and centrality-based features, temporal features consisting of first n retweets, etc. There are three main questions to understand information sharing on Twitter. The first question is which tweet will get retweets and why? The second question is, what is the significance of network structure and position of a user in the network for successful information diffusion? The third question is which user will retweet a tweet and why? To answer these questions, information required includes information about tweet content, network structure and user profiles of the author of the source tweet, and user profiles of users who will retweet it further.

Hemsley [25] used network structure features to predict the extent of information sharing for political messages and found that users with medium size network are more successful in spreading political information as compared to influential users with large network size. Dinh & Parulian [15] used cascade model for retweet, quote and reply tweets for COVID related tweets. They found that average cascade length for retweets is 4 h, for quote tweets is 3 days and for reply tweets is 2 days. This pattern indicates that active actions of users in form of quote and reply have more impact than passive action of retweet. Chen e.t. [10] studied the information sharing in the domain of disaster related tweets using NLP and network features and found that neutral and positive sentiment tweets had larger reach as compared to negative information. This finding is just opposite for political messages. Interestingly, they also found that if any negative information gets few retweets then it gets more responses than positive posts. The panic situation and worries about the disaster impact user behavior to share negative information more rapidly.

For handling big data streams, recent studies have proposed some very promising solutions. Murshed et al. [34] have proposed a model to calculate the overall accuracy of Twitter dataset using three different methods. Atish’s measures outperformed other methods. They found that due to several language issues related to spelling, grammar and unstructured style of writing makes it very challenging to achieve higher level of accuracy. Singh e.t. [42] have proposed a framework for processing of big data using machine learning approach. The proposed framework showcased fast processing using distributed computing and ability to scale performance of machine learning algorithm. The clustering of incoming data stream is very difficult for standard machine learning algorithms. Arpaci et al. [5] have proposed evolutionary clustering for Twitter streams on COVID related tweets. They used 43 M+ tweets as a dataset. Duan et al. [16] proposed an algorithm SELM (Spark Extreme Learning Machine) for multi-classification of big data using Apache Spark cluster. The proposed algorithm performed better and achieved highest speedup than traditional ELM (Extreme Learning Machine) algorithms.

The information sharing can be analyzed from three different points of view. The first view [10, 14, 20, 26] is to predict if a tweet will get a retweet or not? The second view [35, 36, 38, 44] is why tweets of some users get more retweets than other user’s tweets? The third view [18, 31, 46, 47] is to predict which user will retweet a post and why? To answer these questions, many recent studies have proposed a large number of new features and claimed better results. However, every study is unique in terms of a dataset, domain, set of assumptions, manually coded features, and nature of findings. The replication of these studies is not suitable for domain-independent, standard features set, and real-time analysis.

A brief summary of related work categorized by feature set used is given in Table 1.

Table 1 Brief summary of related work

2.1 Challenges for retweet prediction in real time big data analysis

Based on the literature review, following issues are listed:

  • NLP based approaches need language specific libraries and very hard to scale for language independent analysis.

  • Network based approaches need huge amount of information about social circle of each user, which is not feasible for real time random data feed.

  • Manually coded features do not support real time analysis of big data streams.

  • User data is not available from recent studies for performance comparison.

The new features proposed in recent studies are given with the description and whether these features can be extracted using the free Twitter API service. The tweet content features are given in Table 2 and Table 3 shows the features based on the Author profile.

Table 2 List of Features based on Tweet Content
Table 3 List of Features based on User (Author) Profile

3 Methodology

Based on the challenges of retweet prediction for big data streams of random tweets, authors proposed a simple, fast and scalable machine learning approach using simple numeric features available in Twitter API. The category and list of features is shown in Fig. 3. The categorization of features is based on the information contained by a feature. The tweet content features have information about the tweet text and the count of user responses. The user profile features contains the information about the author of the tweet. It includes information about user social circle and user past actions/activities since user account created.

Fig. 3
figure 3

Features used in the study

To understand the active and passive participation of a user, authors have proposed a new feature as “Author total activity”. This feature is defined as the sum of all tweets posted by a user (active action) and the total tweets liked by a user (passive action). For a user, the total tweets posted and total activity shows very large values for old accounts and small values for new accounts. Therefore, the new features are introduced to calculate per year values for these features by dividing it from user account age counted in years.

$$ Aut\mathrm{h} or\ Total\ Activity= Author\ Tweets\ Count+ Author\ Favorites\ count $$
(1)
$$ Author\ Tweets\ per\ year=\frac{Author\ Tweets\ Count}{Account\ Age} $$
(2)
$$ Author\ Total\ Activity\ per\ year=\frac{Author\ Tweets\ Count+ Author\ Favorites\ count}{Account\ Age} $$
(3)

The methodology is explained step by step in Fig. 4. The first requirement is to collect tweets from random feed of Twitter API. Then for each tweet, extract all features available and categorize them into two categories. After that, select only numerical features and compute new proposed features.

Fig. 4
figure 4

Proposed Methodology for Retweet Prediction

The proposed work is an attempt to predict retweets with the help of information available from a single tweet post without any prior information about the user, network structure, temporal features, and historical tweets. For each random tweet there are following questions for retweet prediction:

  • RQ1: How to predict whether a tweet will be retweeted or not?

  • RQ2: How to estimate the exact number of retweets a tweet will get?

  • RQ3: How to categorize tweets into different classes based on estimated ranges of retweet count?

The machine learning algorithms used in this study is regression algorithms and classification algorithms as shown in Fig. 4.

Algorithm for the generation of Features sets from Twitter data stream.

figure a

4 Experimental evaluation and results

4.1 Dataset: The dataset, of 100 million random tweets, is created from the online twitter archive of august 2018 [39, 40]

The description of the dataset used in the study is given in Table 4. The skewness and kurtosis along with other statistical metrics will help to reproduce this dataset and will also help in comparing any other dataset with similar properties. The maximum value for “Tweet char count” and “Tweet emojis count” is very large because Twitter supports Unicode format for emojis in which single emojis can be a combination of multiple characters.

Table 4 Description of 100 Million random Tweets Dataset created from Twitter Archives

4.2 Experimental setup (Fig. 5)

The Twitter data collected using streaming API is available as archives online. The Twitter archives are in compressed file format. These compressed files are a collection of JSON files that contain the actual raw data as received from streaming API. The JSON file format is a very good option for unstructured and text data of variable length. The size of every tweet object can vary depending upon the number of fields. For example, a tweet object of a retweet contains information of tweet author and retweeter, however, an original tweet object has only tweet author information. The NoSQL databases are used for handling variable-length documents with a large number of missing data fields. The MongoDB NoSQL database is used in this study. The distributed computing on 100 million tweets for big data analysis is done on an 8 node Apache Spark cluster where each node had 16 GB RAM, Intel 4 core i5 CPU. The programming is done in python language using the pyspark interface of Apache Spark. The Jupyter notebook is used for IDE.

Fig. 5
figure 5

Schematic representation of Experimental setup used for the study

4.3 Evaluation metrics

The evaluation metrics used in the study is given in Table 5.

Table 5 List of Evaluation metrics
$$ Precision=\frac{TP\_ retweet\_ count}{TP\_ retweet\_ count+ FP\_ retweet\_ count} $$
(4)
$$ Recall=\frac{TP\_ retweet\_ count}{TP\_ retweet\_ count+ FN\_ retweet\_ count} $$
(5)
$$ F1\ Score=\frac{2\ast Precision\_ retweet\_ count\ast Recall\_ retweet\_ count}{\left( Precision\_ retweet\_ count+ Recall\_ retweet\_ count\right)} $$
(6)
$$ Accuracy=\frac{TP\_ retweet\_ count+ TN\_ retweet\_ count}{TP\_ retweet\_ count+ FP\_ retweet\_ count+ FN\_ retweet\_ count+ TN\_ retweet\_ count} $$
(7)

Where TP: True Positive, FP: False Positive, FN: False Negative

$$ Log\ Loss=\frac{-1}{T}{\sum}_{i=1}^T{rc}_i.\log \left(P\left({rc}_i\right)\right)+\left(1-{rc}_i\right).\log \left(1-p\left({rc}_i\right)\right) $$
(8)

Where T: Number of Tweets, rci: observed retweet count

$$ AUC={\int}_i^jf(z). dz $$
(9)

Where i, j are limits of area, f(z) function of the curve

$$ {R}^2=1-\frac{\sum {\left({rc}_i-{\hat{rc}}_i\right)}^2}{\sum {\left({rc}_i-\overline{rc}\right)}^2} $$
(10)
$$ Mean\ Square\ Error=\frac{\sum_{i=1}^T{\left( rc-{\hat{rc}}_i\right)}^2}{T} $$
(11)
$$ Root\ Mean\ Square\ Error=\sqrt{\frac{\sum_{i=1}^T{\left({rc}_i-{\hat{rc}}_i\right)}^2}{T}} $$
(12)
$$ MedAE\ \left( rc,\hat{rc}\right)= median\left(\left| rc1-\hat{rc}1\right|,\dots \dots, \left|{rc}_T-{\hat{rc}}_T\right|\right) $$
(13)

Where rci: observed value, \( {\hat{rc}}_i \): predicted value, \( \overline{rc} \): mean of all observed values

$$ MAE=\frac{1}{T}\ {\sum}_{i=1}^T\left|{rcp}_i-{rct}_i\right| $$
(14)

Where rcpi: retweet count predicted value, rcti: retweet count true value.

4.4 Performance evaluation

To answer the three research questions, three feature sets were tested. The first set consists of only tweet content-based features, the second set consists of only author profile-based features and the third set is a proposed combination of both sets. The performance of each feature set is compared for each algorithm.

RQ1: Whether a tweet will be retweeted or not?

The RQ1 is a binary choice question. The reason for choosing binary labels is that in a random sample of tweets 45% to 50% tweets do not get any retweet [25]. The binary label will help to categories tweets into two classes which will reduce the total number of tweets for further analysis of predicting number of retweets a tweet can get. Two algorithms have been used for this task: logistic regression and logistic model trees. The results are given in Fig. 6, Fig. 7, Tables 6 and 7. All three feature sets were able to predict with very high accuracy. The small improvement is visible in result values starting from tweet content to author features to combined features. The answer to the first research question is yes, it is possible to predict with accuracy that whether a tweet will get a retweet or not.

Fig. 6
figure 6

AUC and PR Curve of Logistic Regression for RQ1. (a) LR: Tweet Content features. (b) LR: Author Profile features. (c) LR: Proposed Combined features

Fig. 7
figure 7

AUC and PR Curve of Logistic Model Trees for RQ1. (a) LMT: Tweet Content features. (b) LMT: Author Profile features. (c) LMT: Proposed Combined features

Table 6 Performance Comparison part 1 for RQ1
Table 7 Performance Comparison part 2 for RQ1

RQ 2: Predict the accurate retweet count for a tweet.

The regression analysis is performed to determine the accurate retweet count for a random tweet. The results from regression algorithms are given in Fig. 8 and Table 8. The results from various regression algorithms indicates that author features performed better that tweet features and combined features gave the best performance as compared to both. The R-squared and RMSE value of every regression algorithm is plotted in Fig. 8. The Random Forest and Decision Tree classifiers performed best among all. All the algorithms produce poor results. It indicates that these features are not a good choice for answering this research question. Hence, the answer to the second research question is that prediction of the exact number of retweets is not possible. These features can be combined with some other features in future studies for exploratory analysis.

Fig 8
figure 8

Regression Analysis for RQ 2. (a) R-Squared Value comparison, (a) RMSE value comparison

Table 8 Performance Comparison of Regression Analysis for RQ2

RQ3: Categorize tweets into multi-label classes.

To classify the tweets into various classes based on ranges of retweet count, different classification algorithms were used. The performance of three feature sets tested on the different number of bins. The criterion of binning is given in Table 8. The results are given in Tables 9, 1011, 12, 13 and 14.

Table 9 The binning criteria for classification
Table 10 Performance Metrics for Decision Tree Classification
Table 11 Performance Metrics for Random Forest Classification
Table 12 Performance Metrics for Gradient Boosted Tree Classification
Table 13 Performance Metrics for SVM Classification
Table 14 Performance Metrics for KNN Classification

The values of precision, recall, F1-score, and Accuracy measure are plotted. The performances of all three feature sets in terms of accuracy measure are above 0.8 score for number of classes less than 4. After that as the number of bins/classes increases, a steady decline in performances is visible. At the highest values of bins (bins = 7), tweet features performed less than 0.6 accuracy score, whereas, author features and proposed combined features performed more than 0.6 accuracy score.

The F1-score is plotted for all three classification algorithms and for all values of bins. The results are shown in Fig. 9. The best performing algorithm is Random Forest with R1- score value always greater than 0.7 for author features and combined features. After that Gradient Boosted Tree performed better as compared to Decision Tree classifier.

Fig. 9
figure 9

Performance comparison of Classification algorithms for RQ3. (a) Decision Tree. (b) Gradient Boosted Tree. (c) Random Forest. (d) SVM. (e) KNN

An interesting observation is that as the number of classes increases, author features perform very close to combined features. This pattern can be interpreted as for large number of classes/bins, the author features can be used instead of combined features which will help in reducing number of total feature required and also reduce the complexity of the system. The results from classification algorithms have shown promising results. The answer to the third research question is that it is possible to categorize tweets in different classes. However, a tradeoff between accuracy and the number of classes should be considered as shown in eq. 1.

$$ Accuracy\propto \frac{1}{Number\ of\ Bins(Classes)} $$
(15)

4.5 Comparison with other works

The comparison of proposed work is given in Table 15. The highlight of the proposed work is feature set proposed have low complexity in implementation.

Table 15 Comparison with recent works on information sharing techniques

5 Conclusions and future work

In this paper, an attempt is made to understand the point of view of a user as information processing node and the role of user profiles on Twitter to predict retweets. The criteria of using only Twitter API as the data source and less number of features provides a unique way of looking at the problem of retweet prediction. The Twitter API is the most common method for data collection from Twitter which makes it a natural choice for creating reproducible research work.

The manually coded features or creating new features using complex algorithms reduces the chances of scaling up and replication for other scenarios. In a recent study [10], it is found that a positive sentiment result in more retweets during natural disasters. However, previous studies [29] have found that negative sentiment increased retweets in the election campaign. In two different domains, same feature resulted in different outcomes. This is an example that some complex features are not good for domain-independent, very large scale fast data processing.

The contribution of this paper is the effort of reducing complexity and the computational requirement for big data analysis of social media data. The ability to use only numerical features is a very fast, scalable, and feasible solution. Two out of three types of features related to retweet prediction are available in Twitter API, from which author features proved to be more significant than tweet content features. The combination of both features produced the best results.

Three new features “author total Activity”, “author total activity per year” and “author tweets per year” are easy to compute, useful in capturing the active and passive participation of a user. The ability to scale down any spikes of total activity value is achieved by dividing the number of years of user account age. The same method is used to scale down the spikes in the count of tweets posted by a user. This averaging of total activity count and total tweet count by the number of years of account age is very useful for those users who are not regularly active. This provides an ability to predict for a random user who is not an influential user or a celebrity. Most of the state of the art research works give more importance to influential users. In real time data analysis, every tweet is important and every user profile is useful for accurate prediction of retweets. The proposed features provide better results for every type of users. These features provide an important insight for categorization of users as trustworthy and less trustworthy account. It will form basis for highlighting genuine users from non-genuine looking accounts.

The proposed method of retweet prediction can easily predict whether a tweet will be retweeted or not. The ability to predict the exact number of retweets is not achievable with these assumptions and feature sets, but it can be used with some other features to reduce the margin of error. The classification of tweets based on retweet count is possible, however, it is difficult to predict accurately with a large number of classes. The fine grain classes come with the drawback of poor accuracy and the small number of classes results in high accuracy but a very large range as one label which is practically not useful for multiclass classification.

In future work, the proposed feature sets will be applied for the categorization of user accounts based on activities, user account role as hub or crowd [17], and the impact of information overload on social media users. For categorization of fake and genuine accounts [32] based on their user profile features, the proposed three features will be used. The proposed profile features will also be used for opinion mining, sentiment analysis and fake account detection.