Role of twitter user profile features in retweet prediction for big data streams

Sharma, Saurabh; Gupta, Vishal

doi:10.1007/s11042-022-12815-1

Role of twitter user profile features in retweet prediction for big data streams

Published: 26 March 2022

Volume 81, pages 27309–27338, (2022)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

Role of twitter user profile features in retweet prediction for big data streams

Download PDF

2661 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

To study the various factors influencing the process of information sharing on Twitter is a very active research area. This paper aims to explore the impact of numerical features extracted from user profiles in retweet prediction from the real-time raw feed of tweets. The originality of this work comes from the fact that the proposed model is based on simple numerical features with the least computational complexity, which is a scalable solution for big data analysis. This research work proposes three new features from the tweet author profile to capture the unique behavioral pattern of the user, namely “Author total activity”, “Author total activity per year”, and “Author tweets per year”. The features set is tested on a dataset of 100 million random tweets collected through Twitter API. The binary labels regression gave an accuracy of 0.98 for user-profile features and gave an accuracy of 0.99 when combined with tweet content features. The regression analysis to predict the retweet count gave an R-squared value of 0.98 with combined features. The multi-label classification gave an accuracy of 0.9 for combined features and 0.89 for user-profile features. The user profile features performed better than tweet content features and performed even better when combined. This model is suitable for near real-time analysis of live streaming data coming through Twitter API and provides a baseline pattern of user behavior based on numerical features available from user profiles only.

Retweet Prediction for Large Datasets of Random Tweets

Analyzing User Behaviors Based on Temporal Patterns of Sequential Pattern Evaluation Indices on Twitter

An Analysis of Twitter Users’ Political Views Using Cross-Account Data Mining

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Today we are living in a world, where people have an active participation in online platforms of social interaction. Some kind or other, online social networks are part of our daily lives. The various types of social media platforms provide different types of services ranging from sharing personal views, collaborating with others, spreading the information of interest, exploring new ideas, discussing real-life events, and participating in evolving communities. Every social media network has a unique purpose, for example, Facebook is primarily used to connect with family and friends, Linkedin is used to connect with people from the professional circle, Instagram is used to share multimedia content, Pinterest is used to explore interesting pins of others and Tumblr is used to find and follow blogs from various categories [4, 49].

In the last 10 years, social media analysis has shown a growth in research studies ranging from ROI for organizations, prediction of real-life changes influenced by social media, descriptive analysis of real-life events as discussed on online platforms [10], viral marketing, social issues, health issues, natural disasters, emergencies, online surveys, countering fake information, detecting cyber bullying and use of abusive language, e-learning, online monitoring, etc.

In the research area of social media analysis, Twitter is a very popular choice of researchers because of its simple method of accessing data using an API interface. The raw feed from Twitter API is very rich in information, in terms of tweet content features and user profile features. The real potential of getting data from API is that it can be used for real-time data analysis and also for batch processing of a huge amount of data [4, 37].

1.1 Motivation

To study the activities of online users and to understand the behavioral pattern of users in various research domains noteworthy efforts are being made in the past few years [29, 35, 36, 38, 44]. The online activities make every user unique from other users which will become visible as strong patterns in time. The behavior patterns or signature style of a user is very useful in authentication, identification, and access control applications [19].

User centric approach

The proposed work is an attempt to predict retweets from the point of view of a single user. A Twitter user is the only identity who will take action using free will. As shown in the Fig. 1, a Twitter user receives a huge amount of information from various sources. This information overload has a very deep impact on user actions. A user can not consume the sheer amount of information at the same pace as it arrives. This will leads to a situation where a user may take an active action or a passive action on the current piece of information. All the actions where a user generates some new content fall under the active actions and all the actions where a user does not generate new content come under passive actions. The action of retweeting comes under the category of passive action because without adding any new information, a user let the existing information flow towards its followers in the network.

These actions form the basis of user behavior and all these actions get recorded in the user profile. For example, a user profile contains information about how many tweets have been posted by a user (active action), and how many tweets are marked as favorites (passive action). The number of tweets retweeted by a user is not recorded in the profile and hence, it is a research problem of retweet prediction by analyzing the all other actions performed by the user from the time a user account has been created.

User data and twitter dataset

The problem of reproducing a Twitter dataset is a major issue for user behavior analysis. The public datasets, release only tweet content features or sometimes just TweetIDs. The challenge of hydrating the dataset from TweetIDs after 4 years results in a loss of 30% dataset [48]. The terms and conditions of Twitter API do not allow fetching user profiles from TweetIDs. The proposed work is an attempt to provide an alternative way to handle this problem by using public Twitter archives [24, 40].

The Fig. 2 has shown three layers of features which can be used for the retweet prediction. The first layer consists of user features which are available with every tweet collected using API. The numerical features can be used as it is and some features can be computed with basic mathematical operations. The second layer i.e. tweets’ content features are partially available in API and more features can be created using complex algorithms such as NLP features. The third layer of features is not directly available in the random feed of tweets. These features must be generated using various methods of data collection, complex algorithms and different assumptions about the structure of network. The recent studies have used various combinations of features from all three layers. However, those methods are not reproducible because user information cannot be shared publically.

User profiles are the most significant part of the user behavior analysis, and easily available with every tweet coming from random feed. Zubiaga et al. [48] found that the most common method of data collection from Twitter is using Twitter streaming API. The use of limited features available in Twitter API can be one of the solutions to generate domain independent, language independent and general purpose analysis on very large datasets. Recent studies [24, 48] have found that due to concerns of user privacy and restrictions imposed by social media companies on the distribution and sharing of dataset makes it very difficult to reproduce the same dataset for social media analysis [48].

A study [40] on the comparison of Twitter datasets and Twitter archives suggested that freely available archives should be used as an alternative way to reproduce and distribute datasets. The available archives are collections of the live feed of random tweets captured using Twitter API. Each tweet contains all data fields available in API as a JSON document. The significance of using archives is that it contains the full user profile along with tweet content features.

1.2 Significance of proposed work

Objectives:

To provide a baseline pattern of retweet prediction (using 100 million random tweets) for domain-independent data feed with a minimum feature set and low computation requirement.
To propose a method for user behavior research that is reproducible, scalable, and using a public dataset without violating the terms and conditions of Twitter API.
To reduce the complexity of social media analysis for big data streams using basic numerical features.
To predict the retweets for every random user irrespective of the fact if a user is a normal user or a influencer/celebrity user.

Conditions:

The dataset contains a random feed without any specific domain, topic, or other conditions.
The proposed feature set is created from features available in Twitter streaming API only.
The dataset, containing full user profiles, is freely available for research.
The feature set includes only numerical values for fast processing and to reduce the computational complexity of text features.

Outcomes:

The user profile features performed better than tweet content features for retweet prediction.
The basic numerical features are very useful for real time user behavior analysis.
No preprocessing requirement for proposed features set makes it fast and scalable for processing of big data streams.
The proposed features set have shown promising results for regression and classification algorithms.
The proposed work is able to predict for every user profile, influential or normal user.

In the following sections, the article is divided as follows. The related work on retweet prediction is given in section 2. In section 3 authors described the methodology of the study. The evaluation of the proposed work using Machine Learning Algorithms is presented in section 4. Section 5 comprises of Conclusions and the future scope of this study.

2 Related works

To understand the user behavior, one interesting research question is, why a user shares few tweets within network and not all of them. The probable reason can be due to information overload, it is practically not possible for a user to keep sharing every incoming tweet. Hemsley [25] found that approximately 47% tweets did not get retweets [14]. It presents an opportunity to study and analysis various factors of user actions to predict information sharing behavior.

Recent studies on information sharing proposed various methods to answer these questions. The studies focused on the content of tweets used sentiment analysis, location-based features, NLP techniques, use of hashtags (#), cashtags ($), URLs, and various text-based statistical features [10, 22, 26, 45]. The text-based approaches demand heavy computational resources and also in some cases all past tweets of the user [10, 23, 27, 43, 47]. The tradeoff between accuracy and computational resources is the bottleneck to scale up for big data analysis and real-time analysis of live data streams.

The graph-based approaches are commonly limited to well-defined network boundaries and some static assumptions about the growth of the network [8]. In reality, to replicate these studies is a very big computation challenge and also very difficult to produce the same accuracy every time due to evolving network structure.

The retweet cascade techniques need data for first k retweets or the first 5–10 min window of temporal features for retweet prediction. The problem with this method is that the time stamp and user profile of each retweeter is needed to create a retweet cascade for every single tweet. These approaches are not useful for live feed data, because it is not possible to monitor every single tweet for its upcoming retweets before starting predicting [14, 18, 28, 31, 46, 47].

Retweet prediction is a very popular way of understanding the dynamics of information sharing on Twitter. In recent years, various combinations of features have been proposed for more accurate retweet prediction. The features range from simple statistical features to more complex features including language-specific NLP features, network structure and centrality-based features, temporal features consisting of first n retweets, etc. There are three main questions to understand information sharing on Twitter. The first question is which tweet will get retweets and why? The second question is, what is the significance of network structure and position of a user in the network for successful information diffusion? The third question is which user will retweet a tweet and why? To answer these questions, information required includes information about tweet content, network structure and user profiles of the author of the source tweet, and user profiles of users who will retweet it further.

Hemsley [25] used network structure features to predict the extent of information sharing for political messages and found that users with medium size network are more successful in spreading political information as compared to influential users with large network size. Dinh & Parulian [15] used cascade model for retweet, quote and reply tweets for COVID related tweets. They found that average cascade length for retweets is 4 h, for quote tweets is 3 days and for reply tweets is 2 days. This pattern indicates that active actions of users in form of quote and reply have more impact than passive action of retweet. Chen e.t. [10] studied the information sharing in the domain of disaster related tweets using NLP and network features and found that neutral and positive sentiment tweets had larger reach as compared to negative information. This finding is just opposite for political messages. Interestingly, they also found that if any negative information gets few retweets then it gets more responses than positive posts. The panic situation and worries about the disaster impact user behavior to share negative information more rapidly.

For handling big data streams, recent studies have proposed some very promising solutions. Murshed et al. [34] have proposed a model to calculate the overall accuracy of Twitter dataset using three different methods. Atish’s measures outperformed other methods. They found that due to several language issues related to spelling, grammar and unstructured style of writing makes it very challenging to achieve higher level of accuracy. Singh e.t. [42] have proposed a framework for processing of big data using machine learning approach. The proposed framework showcased fast processing using distributed computing and ability to scale performance of machine learning algorithm. The clustering of incoming data stream is very difficult for standard machine learning algorithms. Arpaci et al. [5] have proposed evolutionary clustering for Twitter streams on COVID related tweets. They used 43 M+ tweets as a dataset. Duan et al. [16] proposed an algorithm SELM (Spark Extreme Learning Machine) for multi-classification of big data using Apache Spark cluster. The proposed algorithm performed better and achieved highest speedup than traditional ELM (Extreme Learning Machine) algorithms.

The information sharing can be analyzed from three different points of view. The first view [10, 14, 20, 26] is to predict if a tweet will get a retweet or not? The second view [35, 36, 38, 44] is why tweets of some users get more retweets than other user’s tweets? The third view [18, 31, 46, 47] is to predict which user will retweet a post and why? To answer these questions, many recent studies have proposed a large number of new features and claimed better results. However, every study is unique in terms of a dataset, domain, set of assumptions, manually coded features, and nature of findings. The replication of these studies is not suitable for domain-independent, standard features set, and real-time analysis.

A brief summary of related work categorized by feature set used is given in Table 1.

Table 1 Brief summary of related work

Role of twitter user profile features in retweet prediction for big data streams

Abstract

Similar content being viewed by others

Retweet Prediction for Large Datasets of Random Tweets

Analyzing User Behaviors Based on Temporal Patterns of Sequential Pattern Evaluation Indices on Twitter

An Analysis of Twitter Users’ Political Views Using Cross-Account Data Mining

1 Introduction

1.1 Motivation

User centric approach

User data and twitter dataset

1.2 Significance of proposed work

2 Related works

2.1 Challenges for retweet prediction in real time big data analysis

3 Methodology

4 Experimental evaluation and results

4.1 Dataset: The dataset, of 100 million random tweets, is created from the online twitter archive of august 2018 [39, 40]

4.2 Experimental setup (Fig. 5)

4.3 Evaluation metrics

4.4 Performance evaluation

4.5 Comparison with other works

5 Conclusions and future work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation