Skip to main content

Role of twitter user profile features in retweet prediction for big data streams


To study the various factors influencing the process of information sharing on Twitter is a very active research area. This paper aims to explore the impact of numerical features extracted from user profiles in retweet prediction from the real-time raw feed of tweets. The originality of this work comes from the fact that the proposed model is based on simple numerical features with the least computational complexity, which is a scalable solution for big data analysis. This research work proposes three new features from the tweet author profile to capture the unique behavioral pattern of the user, namely “Author total activity”, “Author total activity per year”, and “Author tweets per year”. The features set is tested on a dataset of 100 million random tweets collected through Twitter API. The binary labels regression gave an accuracy of 0.98 for user-profile features and gave an accuracy of 0.99 when combined with tweet content features. The regression analysis to predict the retweet count gave an R-squared value of 0.98 with combined features. The multi-label classification gave an accuracy of 0.9 for combined features and 0.89 for user-profile features. The user profile features performed better than tweet content features and performed even better when combined. This model is suitable for near real-time analysis of live streaming data coming through Twitter API and provides a baseline pattern of user behavior based on numerical features available from user profiles only.


Today we are living in a world, where people have an active participation in online platforms of social interaction. Some kind or other, online social networks are part of our daily lives. The various types of social media platforms provide different types of services ranging from sharing personal views, collaborating with others, spreading the information of interest, exploring new ideas, discussing real-life events, and participating in evolving communities. Every social media network has a unique purpose, for example, Facebook is primarily used to connect with family and friends, Linkedin is used to connect with people from the professional circle, Instagram is used to share multimedia content, Pinterest is used to explore interesting pins of others and Tumblr is used to find and follow blogs from various categories [4, 49].

In the last 10 years, social media analysis has shown a growth in research studies ranging from ROI for organizations, prediction of real-life changes influenced by social media, descriptive analysis of real-life events as discussed on online platforms [10], viral marketing, social issues, health issues, natural disasters, emergencies, online surveys, countering fake information, detecting cyber bullying and use of abusive language, e-learning, online monitoring, etc.

In the research area of social media analysis, Twitter is a very popular choice of researchers because of its simple method of accessing data using an API interface. The raw feed from Twitter API is very rich in information, in terms of tweet content features and user profile features. The real potential of getting data from API is that it can be used for real-time data analysis and also for batch processing of a huge amount of data [4, 37].


To study the activities of online users and to understand the behavioral pattern of users in various research domains noteworthy efforts are being made in the past few years [29, 35, 36, 38, 44]. The online activities make every user unique from other users which will become visible as strong patterns in time. The behavior patterns or signature style of a user is very useful in authentication, identification, and access control applications [19].

User centric approach

The proposed work is an attempt to predict retweets from the point of view of a single user. A Twitter user is the only identity who will take action using free will. As shown in the Fig. 1, a Twitter user receives a huge amount of information from various sources. This information overload has a very deep impact on user actions. A user can not consume the sheer amount of information at the same pace as it arrives. This will leads to a situation where a user may take an active action or a passive action on the current piece of information. All the actions where a user generates some new content fall under the active actions and all the actions where a user does not generate new content come under passive actions. The action of retweeting comes under the category of passive action because without adding any new information, a user let the existing information flow towards its followers in the network.

Fig. 1
figure 1

Twitter user as information processing node

These actions form the basis of user behavior and all these actions get recorded in the user profile. For example, a user profile contains information about how many tweets have been posted by a user (active action), and how many tweets are marked as favorites (passive action). The number of tweets retweeted by a user is not recorded in the profile and hence, it is a research problem of retweet prediction by analyzing the all other actions performed by the user from the time a user account has been created.

User data and twitter dataset

The problem of reproducing a Twitter dataset is a major issue for user behavior analysis. The public datasets, release only tweet content features or sometimes just TweetIDs. The challenge of hydrating the dataset from TweetIDs after 4 years results in a loss of 30% dataset [48]. The terms and conditions of Twitter API do not allow fetching user profiles from TweetIDs. The proposed work is an attempt to provide an alternative way to handle this problem by using public Twitter archives [24, 40].

The Fig. 2 has shown three layers of features which can be used for the retweet prediction. The first layer consists of user features which are available with every tweet collected using API. The numerical features can be used as it is and some features can be computed with basic mathematical operations. The second layer i.e. tweets’ content features are partially available in API and more features can be created using complex algorithms such as NLP features. The third layer of features is not directly available in the random feed of tweets. These features must be generated using various methods of data collection, complex algorithms and different assumptions about the structure of network. The recent studies have used various combinations of features from all three layers. However, those methods are not reproducible because user information cannot be shared publically.

Fig. 2
figure 2

Twitter API and availability of Features

User profiles are the most significant part of the user behavior analysis, and easily available with every tweet coming from random feed. Zubiaga et al. [48] found that the most common method of data collection from Twitter is using Twitter streaming API. The use of limited features available in Twitter API can be one of the solutions to generate domain independent, language independent and general purpose analysis on very large datasets. Recent studies [24, 48] have found that due to concerns of user privacy and restrictions imposed by social media companies on the distribution and sharing of dataset makes it very difficult to reproduce the same dataset for social media analysis [48].

A study [40] on the comparison of Twitter datasets and Twitter archives suggested that freely available archives should be used as an alternative way to reproduce and distribute datasets. The available archives are collections of the live feed of random tweets captured using Twitter API. Each tweet contains all data fields available in API as a JSON document. The significance of using archives is that it contains the full user profile along with tweet content features.

Significance of proposed work


  • To provide a baseline pattern of retweet prediction (using 100 million random tweets) for domain-independent data feed with a minimum feature set and low computation requirement.

  • To propose a method for user behavior research that is reproducible, scalable, and using a public dataset without violating the terms and conditions of Twitter API.

  • To reduce the complexity of social media analysis for big data streams using basic numerical features.

  • To predict the retweets for every random user irrespective of the fact if a user is a normal user or a influencer/celebrity user.


  • The dataset contains a random feed without any specific domain, topic, or other conditions.

  • The proposed feature set is created from features available in Twitter streaming API only.

  • The dataset, containing full user profiles, is freely available for research.

  • The feature set includes only numerical values for fast processing and to reduce the computational complexity of text features.


  • The user profile features performed better than tweet content features for retweet prediction.

  • The basic numerical features are very useful for real time user behavior analysis.

  • No preprocessing requirement for proposed features set makes it fast and scalable for processing of big data streams.

  • The proposed features set have shown promising results for regression and classification algorithms.

  • The proposed work is able to predict for every user profile, influential or normal user.

In the following sections, the article is divided as follows. The related work on retweet prediction is given in section 2. In section 3 authors described the methodology of the study. The evaluation of the proposed work using Machine Learning Algorithms is presented in section 4. Section 5 comprises of Conclusions and the future scope of this study.

Related works

To understand the user behavior, one interesting research question is, why a user shares few tweets within network and not all of them. The probable reason can be due to information overload, it is practically not possible for a user to keep sharing every incoming tweet. Hemsley [25] found that approximately 47% tweets did not get retweets [14]. It presents an opportunity to study and analysis various factors of user actions to predict information sharing behavior.

Recent studies on information sharing proposed various methods to answer these questions. The studies focused on the content of tweets used sentiment analysis, location-based features, NLP techniques, use of hashtags (#), cashtags ($), URLs, and various text-based statistical features [10, 22, 26, 45]. The text-based approaches demand heavy computational resources and also in some cases all past tweets of the user [10, 23, 27, 43, 47]. The tradeoff between accuracy and computational resources is the bottleneck to scale up for big data analysis and real-time analysis of live data streams.

The graph-based approaches are commonly limited to well-defined network boundaries and some static assumptions about the growth of the network [8]. In reality, to replicate these studies is a very big computation challenge and also very difficult to produce the same accuracy every time due to evolving network structure.

The retweet cascade techniques need data for first k retweets or the first 5–10 min window of temporal features for retweet prediction. The problem with this method is that the time stamp and user profile of each retweeter is needed to create a retweet cascade for every single tweet. These approaches are not useful for live feed data, because it is not possible to monitor every single tweet for its upcoming retweets before starting predicting [14, 18, 2831, 4647].

Retweet prediction is a very popular way of understanding the dynamics of information sharing on Twitter. In recent years, various combinations of features have been proposed for more accurate retweet prediction. The features range from simple statistical features to more complex features including language-specific NLP features, network structure and centrality-based features, temporal features consisting of first n retweets, etc. There are three main questions to understand information sharing on Twitter. The first question is which tweet will get retweets and why? The second question is, what is the significance of network structure and position of a user in the network for successful information diffusion? The third question is which user will retweet a tweet and why? To answer these questions, information required includes information about tweet content, network structure and user profiles of the author of the source tweet, and user profiles of users who will retweet it further.

Hemsley [25] used network structure features to predict the extent of information sharing for political messages and found that users with medium size network are more successful in spreading political information as compared to influential users with large network size. Dinh & Parulian [15] used cascade model for retweet, quote and reply tweets for COVID related tweets. They found that average cascade length for retweets is 4 h, for quote tweets is 3 days and for reply tweets is 2 days. This pattern indicates that active actions of users in form of quote and reply have more impact than passive action of retweet. Chen e.t. [10] studied the information sharing in the domain of disaster related tweets using NLP and network features and found that neutral and positive sentiment tweets had larger reach as compared to negative information. This finding is just opposite for political messages. Interestingly, they also found that if any negative information gets few retweets then it gets more responses than positive posts. The panic situation and worries about the disaster impact user behavior to share negative information more rapidly.

For handling big data streams, recent studies have proposed some very promising solutions. Murshed et al. [34] have proposed a model to calculate the overall accuracy of Twitter dataset using three different methods. Atish’s measures outperformed other methods. They found that due to several language issues related to spelling, grammar and unstructured style of writing makes it very challenging to achieve higher level of accuracy. Singh e.t. [42] have proposed a framework for processing of big data using machine learning approach. The proposed framework showcased fast processing using distributed computing and ability to scale performance of machine learning algorithm. The clustering of incoming data stream is very difficult for standard machine learning algorithms. Arpaci et al. [5] have proposed evolutionary clustering for Twitter streams on COVID related tweets. They used 43 M+ tweets as a dataset. Duan et al. [16] proposed an algorithm SELM (Spark Extreme Learning Machine) for multi-classification of big data using Apache Spark cluster. The proposed algorithm performed better and achieved highest speedup than traditional ELM (Extreme Learning Machine) algorithms.

The information sharing can be analyzed from three different points of view. The first view [10, 14, 20, 26] is to predict if a tweet will get a retweet or not? The second view [35, 36, 38, 44] is why tweets of some users get more retweets than other user’s tweets? The third view [18, 31, 46, 47] is to predict which user will retweet a post and why? To answer these questions, many recent studies have proposed a large number of new features and claimed better results. However, every study is unique in terms of a dataset, domain, set of assumptions, manually coded features, and nature of findings. The replication of these studies is not suitable for domain-independent, standard features set, and real-time analysis.

A brief summary of related work categorized by feature set used is given in Table 1.

Table 1 Brief summary of related work

Challenges for retweet prediction in real time big data analysis

Based on the literature review, following issues are listed:

  • NLP based approaches need language specific libraries and very hard to scale for language independent analysis.

  • Network based approaches need huge amount of information about social circle of each user, which is not feasible for real time random data feed.

  • Manually coded features do not support real time analysis of big data streams.

  • User data is not available from recent studies for performance comparison.

The new features proposed in recent studies are given with the description and whether these features can be extracted using the free Twitter API service. The tweet content features are given in Table 2 and Table 3 shows the features based on the Author profile.

Table 2 List of Features based on Tweet Content
Table 3 List of Features based on User (Author) Profile


Based on the challenges of retweet prediction for big data streams of random tweets, authors proposed a simple, fast and scalable machine learning approach using simple numeric features available in Twitter API. The category and list of features is shown in Fig. 3. The categorization of features is based on the information contained by a feature. The tweet content features have information about the tweet text and the count of user responses. The user profile features contains the information about the author of the tweet. It includes information about user social circle and user past actions/activities since user account created.

Fig. 3
figure 3

Features used in the study

To understand the active and passive participation of a user, authors have proposed a new feature as “Author total activity”. This feature is defined as the sum of all tweets posted by a user (active action) and the total tweets liked by a user (passive action). For a user, the total tweets posted and total activity shows very large values for old accounts and small values for new accounts. Therefore, the new features are introduced to calculate per year values for these features by dividing it from user account age counted in years.

$$ Aut\mathrm{h} or\ Total\ Activity= Author\ Tweets\ Count+ Author\ Favorites\ count $$
$$ Author\ Tweets\ per\ year=\frac{Author\ Tweets\ Count}{Account\ Age} $$
$$ Author\ Total\ Activity\ per\ year=\frac{Author\ Tweets\ Count+ Author\ Favorites\ count}{Account\ Age} $$

The methodology is explained step by step in Fig. 4. The first requirement is to collect tweets from random feed of Twitter API. Then for each tweet, extract all features available and categorize them into two categories. After that, select only numerical features and compute new proposed features.

Fig. 4
figure 4

Proposed Methodology for Retweet Prediction

The proposed work is an attempt to predict retweets with the help of information available from a single tweet post without any prior information about the user, network structure, temporal features, and historical tweets. For each random tweet there are following questions for retweet prediction:

  • RQ1: How to predict whether a tweet will be retweeted or not?

  • RQ2: How to estimate the exact number of retweets a tweet will get?

  • RQ3: How to categorize tweets into different classes based on estimated ranges of retweet count?

The machine learning algorithms used in this study is regression algorithms and classification algorithms as shown in Fig. 4.

Algorithm for the generation of Features sets from Twitter data stream.

figure a

Experimental evaluation and results

Dataset: The dataset, of 100 million random tweets, is created from the online twitter archive of august 2018 [39, 40]

The description of the dataset used in the study is given in Table 4. The skewness and kurtosis along with other statistical metrics will help to reproduce this dataset and will also help in comparing any other dataset with similar properties. The maximum value for “Tweet char count” and “Tweet emojis count” is very large because Twitter supports Unicode format for emojis in which single emojis can be a combination of multiple characters.

Table 4 Description of 100 Million random Tweets Dataset created from Twitter Archives

Experimental setup (Fig. 5)

The Twitter data collected using streaming API is available as archives online. The Twitter archives are in compressed file format. These compressed files are a collection of JSON files that contain the actual raw data as received from streaming API. The JSON file format is a very good option for unstructured and text data of variable length. The size of every tweet object can vary depending upon the number of fields. For example, a tweet object of a retweet contains information of tweet author and retweeter, however, an original tweet object has only tweet author information. The NoSQL databases are used for handling variable-length documents with a large number of missing data fields. The MongoDB NoSQL database is used in this study. The distributed computing on 100 million tweets for big data analysis is done on an 8 node Apache Spark cluster where each node had 16 GB RAM, Intel 4 core i5 CPU. The programming is done in python language using the pyspark interface of Apache Spark. The Jupyter notebook is used for IDE.

Fig. 5
figure 5

Schematic representation of Experimental setup used for the study

Evaluation metrics

The evaluation metrics used in the study is given in Table 5.

Table 5 List of Evaluation metrics
$$ Precision=\frac{TP\_ retweet\_ count}{TP\_ retweet\_ count+ FP\_ retweet\_ count} $$
$$ Recall=\frac{TP\_ retweet\_ count}{TP\_ retweet\_ count+ FN\_ retweet\_ count} $$
$$ F1\ Score=\frac{2\ast Precision\_ retweet\_ count\ast Recall\_ retweet\_ count}{\left( Precision\_ retweet\_ count+ Recall\_ retweet\_ count\right)} $$
$$ Accuracy=\frac{TP\_ retweet\_ count+ TN\_ retweet\_ count}{TP\_ retweet\_ count+ FP\_ retweet\_ count+ FN\_ retweet\_ count+ TN\_ retweet\_ count} $$

Where TP: True Positive, FP: False Positive, FN: False Negative

$$ Log\ Loss=\frac{-1}{T}{\sum}_{i=1}^T{rc}_i.\log \left(P\left({rc}_i\right)\right)+\left(1-{rc}_i\right).\log \left(1-p\left({rc}_i\right)\right) $$

Where T: Number of Tweets, rci: observed retweet count

$$ AUC={\int}_i^jf(z). dz $$

Where i, j are limits of area, f(z) function of the curve

$$ {R}^2=1-\frac{\sum {\left({rc}_i-{\hat{rc}}_i\right)}^2}{\sum {\left({rc}_i-\overline{rc}\right)}^2} $$
$$ Mean\ Square\ Error=\frac{\sum_{i=1}^T{\left( rc-{\hat{rc}}_i\right)}^2}{T} $$
$$ Root\ Mean\ Square\ Error=\sqrt{\frac{\sum_{i=1}^T{\left({rc}_i-{\hat{rc}}_i\right)}^2}{T}} $$
$$ MedAE\ \left( rc,\hat{rc}\right)= median\left(\left| rc1-\hat{rc}1\right|,\dots \dots, \left|{rc}_T-{\hat{rc}}_T\right|\right) $$

Where rci: observed value, \( {\hat{rc}}_i \): predicted value, \( \overline{rc} \): mean of all observed values

$$ MAE=\frac{1}{T}\ {\sum}_{i=1}^T\left|{rcp}_i-{rct}_i\right| $$

Where rcpi: retweet count predicted value, rcti: retweet count true value.

Performance evaluation

To answer the three research questions, three feature sets were tested. The first set consists of only tweet content-based features, the second set consists of only author profile-based features and the third set is a proposed combination of both sets. The performance of each feature set is compared for each algorithm.

RQ1: Whether a tweet will be retweeted or not?

The RQ1 is a binary choice question. The reason for choosing binary labels is that in a random sample of tweets 45% to 50% tweets do not get any retweet [25]. The binary label will help to categories tweets into two classes which will reduce the total number of tweets for further analysis of predicting number of retweets a tweet can get. Two algorithms have been used for this task: logistic regression and logistic model trees. The results are given in Fig. 6, Fig. 7, Tables 6 and 7. All three feature sets were able to predict with very high accuracy. The small improvement is visible in result values starting from tweet content to author features to combined features. The answer to the first research question is yes, it is possible to predict with accuracy that whether a tweet will get a retweet or not.

Fig. 6
figure 6

AUC and PR Curve of Logistic Regression for RQ1. (a) LR: Tweet Content features. (b) LR: Author Profile features. (c) LR: Proposed Combined features

Fig. 7
figure 7

AUC and PR Curve of Logistic Model Trees for RQ1. (a) LMT: Tweet Content features. (b) LMT: Author Profile features. (c) LMT: Proposed Combined features

Table 6 Performance Comparison part 1 for RQ1
Table 7 Performance Comparison part 2 for RQ1

RQ 2: Predict the accurate retweet count for a tweet.

The regression analysis is performed to determine the accurate retweet count for a random tweet. The results from regression algorithms are given in Fig. 8 and Table 8. The results from various regression algorithms indicates that author features performed better that tweet features and combined features gave the best performance as compared to both. The R-squared and RMSE value of every regression algorithm is plotted in Fig. 8. The Random Forest and Decision Tree classifiers performed best among all. All the algorithms produce poor results. It indicates that these features are not a good choice for answering this research question. Hence, the answer to the second research question is that prediction of the exact number of retweets is not possible. These features can be combined with some other features in future studies for exploratory analysis.

Fig 8
figure 8

Regression Analysis for RQ 2. (a) R-Squared Value comparison, (a) RMSE value comparison

Table 8 Performance Comparison of Regression Analysis for RQ2

RQ3: Categorize tweets into multi-label classes.

To classify the tweets into various classes based on ranges of retweet count, different classification algorithms were used. The performance of three feature sets tested on the different number of bins. The criterion of binning is given in Table 8. The results are given in Tables 9, 1011, 12, 13 and 14.

Table 9 The binning criteria for classification
Table 10 Performance Metrics for Decision Tree Classification
Table 11 Performance Metrics for Random Forest Classification
Table 12 Performance Metrics for Gradient Boosted Tree Classification
Table 13 Performance Metrics for SVM Classification
Table 14 Performance Metrics for KNN Classification

The values of precision, recall, F1-score, and Accuracy measure are plotted. The performances of all three feature sets in terms of accuracy measure are above 0.8 score for number of classes less than 4. After that as the number of bins/classes increases, a steady decline in performances is visible. At the highest values of bins (bins = 7), tweet features performed less than 0.6 accuracy score, whereas, author features and proposed combined features performed more than 0.6 accuracy score.

The F1-score is plotted for all three classification algorithms and for all values of bins. The results are shown in Fig. 9. The best performing algorithm is Random Forest with R1- score value always greater than 0.7 for author features and combined features. After that Gradient Boosted Tree performed better as compared to Decision Tree classifier.

Fig. 9
figure 9

Performance comparison of Classification algorithms for RQ3. (a) Decision Tree. (b) Gradient Boosted Tree. (c) Random Forest. (d) SVM. (e) KNN

An interesting observation is that as the number of classes increases, author features perform very close to combined features. This pattern can be interpreted as for large number of classes/bins, the author features can be used instead of combined features which will help in reducing number of total feature required and also reduce the complexity of the system. The results from classification algorithms have shown promising results. The answer to the third research question is that it is possible to categorize tweets in different classes. However, a tradeoff between accuracy and the number of classes should be considered as shown in eq. 1.

$$ Accuracy\propto \frac{1}{Number\ of\ Bins(Classes)} $$

Comparison with other works

The comparison of proposed work is given in Table 15. The highlight of the proposed work is feature set proposed have low complexity in implementation.

Table 15 Comparison with recent works on information sharing techniques

Conclusions and future work

In this paper, an attempt is made to understand the point of view of a user as information processing node and the role of user profiles on Twitter to predict retweets. The criteria of using only Twitter API as the data source and less number of features provides a unique way of looking at the problem of retweet prediction. The Twitter API is the most common method for data collection from Twitter which makes it a natural choice for creating reproducible research work.

The manually coded features or creating new features using complex algorithms reduces the chances of scaling up and replication for other scenarios. In a recent study [10], it is found that a positive sentiment result in more retweets during natural disasters. However, previous studies [29] have found that negative sentiment increased retweets in the election campaign. In two different domains, same feature resulted in different outcomes. This is an example that some complex features are not good for domain-independent, very large scale fast data processing.

The contribution of this paper is the effort of reducing complexity and the computational requirement for big data analysis of social media data. The ability to use only numerical features is a very fast, scalable, and feasible solution. Two out of three types of features related to retweet prediction are available in Twitter API, from which author features proved to be more significant than tweet content features. The combination of both features produced the best results.

Three new features “author total Activity”, “author total activity per year” and “author tweets per year” are easy to compute, useful in capturing the active and passive participation of a user. The ability to scale down any spikes of total activity value is achieved by dividing the number of years of user account age. The same method is used to scale down the spikes in the count of tweets posted by a user. This averaging of total activity count and total tweet count by the number of years of account age is very useful for those users who are not regularly active. This provides an ability to predict for a random user who is not an influential user or a celebrity. Most of the state of the art research works give more importance to influential users. In real time data analysis, every tweet is important and every user profile is useful for accurate prediction of retweets. The proposed features provide better results for every type of users. These features provide an important insight for categorization of users as trustworthy and less trustworthy account. It will form basis for highlighting genuine users from non-genuine looking accounts.

The proposed method of retweet prediction can easily predict whether a tweet will be retweeted or not. The ability to predict the exact number of retweets is not achievable with these assumptions and feature sets, but it can be used with some other features to reduce the margin of error. The classification of tweets based on retweet count is possible, however, it is difficult to predict accurately with a large number of classes. The fine grain classes come with the drawback of poor accuracy and the small number of classes results in high accuracy but a very large range as one label which is practically not useful for multiclass classification.

In future work, the proposed feature sets will be applied for the categorization of user accounts based on activities, user account role as hub or crowd [17], and the impact of information overload on social media users. For categorization of fake and genuine accounts [32] based on their user profile features, the proposed three features will be used. The proposed profile features will also be used for opinion mining, sentiment analysis and fake account detection.


  1. Adewole KS, Anuar NB, Kamsin A, Sangaiah AK (2019) SMSAD: a framework for spam message and spam account detection. Multimed Tools Appl 78(4):3925–3960

    Google Scholar 

  2. Aggarwal, A., Rajadesingan, A., & Kumaraguru, P. (2012). PhishAri: automatic realtime phishing detection on twitter. In 2012 eCrime researchers summit :1-12IEEE.

  3. Alsaleh M, Alarifi A, Al-Salman AM, Alfayez M, & Almuhaysin A (2014). Tsd: detecting sybil accounts in twitter. 13th international conference on machine learning and applications :463-469.IEEE.

  4. Antonakaki D, Fragopoulou P, Ioannidis S (2021) A survey of twitter research: data model, graph structure, sentiment analysis and attacks. Expert Syst Appl 164:114006

    Google Scholar 

  5. Arpaci I, Alshehabi S, Al-Emran M, Khasawneh M, Mahariq I, Abdeljawad T, Hassanien AE (2020) Analysis of twitter data using evolutionary clustering during the COVID-19 pandemic. Comput Mater Contin 65(1):193–203

    Google Scholar 

  6. Bhowmick AK, Gueuning M, Delvenne JC, Lambiotte R, Mitra B (2019) Temporal sequence of retweets help to detect influential nodes in social networks. IEEE Trans Comput Soc Syst 6(3):441–455

    Google Scholar 

  7. Chen L, Deng H (2020) Predicting user retweeting behavior in social networks with a novel ensemble learning approach. IEEE Access 8:148250–148263

    Google Scholar 

  8. Chen G, Kong Q, Xu N, Mao W (2019) NPP: a neural popularity prediction model for social media content. Neurocomputing 333:221–230

    Google Scholar 

  9. Chen S, Li S, Chen S, Yuan X (2019) R-map: a map metaphor for visualizing information reposting process in social media. IEEE Trans Vis Comput Graph 26(1):1204–1214

    Google Scholar 

  10. Chen S, Mao J, Li G, Ma C, Cao Y (2020) Uncovering sentiment and retweet patterns of disaster-related tweets from a spatiotemporal perspective–a case study of hurricane Harvey. Telematics Inform 47:101326

    Google Scholar 

  11. Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824

    Google Scholar 

  12. Chung W, Toraman C, Huang Y, Vora M, & Liu J (2019). A Deep Learning Approach to Modeling Temporal Social Networks on Reddit. In 2019 IEEE International Conference on Intelligence and Security Informatics (ISI) :68–73. IEEE.

  13. Daga I, Gupta A, Vardhan R, Mukherjee P (2020) Prediction of likes and retweets using text information retrieval. Procedia Comput Sci 168:123–128

    Google Scholar 

  14. Dinh L, Parulian N (2020) COVID-19 pandemic and information diffusion analysis on twitter. Proc Assoc Inf Sci Technol 57(1):e252

    Google Scholar 

  15. Duan M, Li K, Liao X, Li K (2017) A parallel multiclassification algorithm for big data using an extreme learning machine. IEEE Trans Neural Netw Learn Syst 29(6):2337–2351

    MathSciNet  Google Scholar 

  16. Dutta HS, Dutta VR, Adhikary A, Chakraborty T (2020) HawkesEye: detecting fake retweeters using Hawkes process and topic modeling. IEEE Transactions on Information Forensics and Security 15:2667–2678

    Google Scholar 

  17. Fan C, Jiang Y, Yang Y, Zhang C, Mostafavi A (2020) Crowd or hubs: information diffusion patterns in online social networks in disasters. Int J Disaster Risk Reduct 46:101498

    Google Scholar 

  18. Firdaus SN, Ding C, Sadeghian A (2018) Retweet: a popular information diffusion mechanism–a survey paper. Online Soc Netw Media 6:26–40

    Google Scholar 

  19. Firdaus SN, Ding C, Sadeghian A (2019) Topic specific emotion detection for retweet prediction. Int J Mach Learn Cybern 10(8):2071–2083

    Google Scholar 

  20. Gao X, Zheng Z, Chu Q, Tang S, Chen G, Deng Q (2019) Popularity prediction for single tweet based on heterogeneous bass model. IEEE Trans Knowl Data Eng:1

  21. Hemphill L, Hedstrom ML, Leonard SH (2021) Saving social media data: understanding data management practices among social media researchers and their implications for archives. J Assoc Inf Sci Technol 72(1):97–109

    Google Scholar 

  22. Hemsley J (2019) Followers retweet! The influence of middle-level gatekeepers on the spread of political information on twitter. Policy Internet 11(3):280–304

    Google Scholar 

  23. Jain DK, Kumar A, Sharma V (2020) Tweet recommender model using adaptive neuro-fuzzy inference system. Futur Gener Comput Syst 112:996–1009

    Google Scholar 

  24. Jalali NY, Papatla P (2019) Composing tweets to increase retweets. Int J Res Mark 36(4):647–668

    Google Scholar 

  25. Jung AK, Ross B, Stieglitz S (2020) Caution: rumors ahead—a case study on the debunking of false information on twitter. Big Data Soc 7(2):2053951720980127

    Google Scholar 

  26. Lee S, & Kim J (2014) Early filtering of ephemeral malicious accounts on Twitter.Computer communications 54:48-57.

  27. Lee J, Xu W (2018) The more attacks, the more retweets: Trump’s and Clinton’s agenda setting on twitter. Public Relat Rev 44(2):201–213

    Google Scholar 

  28. Lymperopoulos IN (2021) RC-tweet: modeling and predicting the popularity of tweets through the dynamics of a capacitor. Expert Syst Appl 163:113785

    Google Scholar 

  29. Miller Z, Dickinson B, Deitrick W, Hu W, Wang AH (2014) Twitter spammer detection using data stream clustering. Inf Sci 260:64–73

    Google Scholar 

  30. Murshed BAH, Al-Ariki HDE, Mallappa S (2020) Semantic analysis techniques using twitter datasets on big data: comparative analysis study. Comput Syst Sci Eng 35(6):495–512

    Google Scholar 

  31. Nesi P, Pantaleo G, Paoli I, Zaza I (2018) Assessing the reTweet proneness of tweets: predictive models for retweeting. Multimed Tools Appl 77(20):26371–26396

    Google Scholar 

  32. PV, S., & Bhanu, S. (2020) UbCadet: detection of compromised accounts in twitter based on user behavioural profiling. Multimed Tools Appl 79:1–37

    Google Scholar 

  33. Rousidis D, Koukaras P, Tjortjis C (2020) Social media prediction: a literature review. Multimed Tools Appl 79(9):6279–6311

    Google Scholar 

  34. Safari RM, Rahmani AM, Alizadeh SH (2019) User behavior mining on social media: a systematic literature review. Multimed Tools Appl 78(23):33747–33804

    Google Scholar 

  35. Scott, Jason, and Sketch the Cow. “Archiveteam-Twitter-Stream-2018-08 : Free Download, Borrow, and Streaming.” Internet Archive, Archive Team: The Twitter Stream Grab, 6 Dec. 2012, 01:03:03,

  36. Sequiera R, & Lin J (2017) Finally, a downloadable test collection of tweets. In proceedings of the 40th international ACM SIGIR conference on Research and Development in information retrieval :1225-1228.

  37. Shyni CE, Sundar AD, Ebby GSE (2016) Spam profile detection in online social network using statistical approach. Asian J Inf Technol 15(7):1253–1262

    Google Scholar 

  38. Singh SK, Cha J, Kim TW, Park JH (2021) Machine learning based distributed big data analysis framework for next generation web in IoT. Comput Sci Inf Syst 18(2):597–618

    Google Scholar 

  39. Son J, Lee HK, Jin S, Lee J (2019) Content features of tweets for effective communication during disasters: a media synchronicity theory perspective. Int J Inf Manag 45:56–68

    Google Scholar 

  40. Son J, Lee J, Oh O, Lee HK, Woo J (2020) Using a heuristic-systematic model to assess the twitter user profile’s impact on disaster tweet credibility. Int J Inf Manag 54:102176

    Google Scholar 

  41. Tardelli S, Avvenuti M, Tesconi M, Cresci S (2020) Characterizing social bots spreading financial disinformation. In: International conference on human-computer interaction :pp. Springer, Cham, pp 376–392

    Google Scholar 

  42. Tian Y, Fan R, Ding X, Zhang X, Gan T (2020) Predicting rumor retweeting behavior of social media users in public emergencies. IEEE Access 8:87121–87132

    Google Scholar 

  43. Wang S, Li C, Wang Z, Chen H, Zheng K (2020) BPF++: a unified factorization model for predicting retweet behaviors. Inf Sci 515:218–232

    Google Scholar 

  44. Yang C, Harkreader R, Gu G (2013) Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Trans Inf Forensics Secur 8(8):1280–1293

    Google Scholar 

  45. Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34

    Google Scholar 

  46. Zhou F, Xu X, Trajcevski G, Zhang K (2021) A survey of information cascade analysis: models, predictions, and recent advances. ACM Comput Surv 54(2):1–36

    Google Scholar 

  47. Zola P, Cortez P, Carpita M (2019) Twitter user geolocation using web country noun searches. Decis Support Syst 120:50–59

    Google Scholar 

  48. Zubiaga A (2018) A longitudinal assessment of the persistence of twitter datasets. J Assoc Inf Sci Technol 69(8):974–984

    Google Scholar 

  49. Zubiaga A, Aker A, Bontcheva K, Liakata M, Procter R (2018) Detection and resolution of rumours in social media: a survey. ACM Comput Surv 51(2):1–36

    Google Scholar 

Download references


The author wishes to thank the Design and Innovation Center, Chandigarh, UIET, Panjab University for providing computational resources for big data analysis on the Apache Spark cluster.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Vishal Gupta.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose. The authors have no conflicts of interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sharma, S., Gupta, V. Role of twitter user profile features in retweet prediction for big data streams. Multimed Tools Appl (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI:


  • Twitter
  • Social media analysis
  • Retweet prediction
  • User behavior
  • User profiling
  • Big data analysis