Twitter accounts are used for a multitude of reasons, including social, commercial, political, religious, and ideological purposes. The wide variety of activities on Twitter may be automated or non-automated. Any serious attempt to explore the nature of the vast amount of information being broadcast over such a medium may depend on identifying a potentially useful set of content features hidden within the data. This paper proposes a set of content features that may be promising in efforts to categorize social media activities, with the goal of creating predictive models that will classify or estimate the probabilities of automated behavior given certain account content history. Suggestions for future work are offered.
- Social media
- Content feature extraction
Social media activity data, in the case of this paper Twitter account activity, can be understood as consisting of two primary components, metadata or demographics, and content data. Metadata involves external characteristics such as time of activity, time of account creation, location, type of platform used for activity, number of friends, followers, and more. Content data involves syntactic and semantic characteristics. The focus of this paper is on content data, in particular, content feature extraction that can be implemented on a large set of text data in order to enable categorization of types of activities and classification of activities as automated versus non-automated.
1.2 The Content Data Elements and Their Encoding
Below are some linguistic features that can be extracted from the text content generated by Twitter users. These features can be used to generate mathematical “signatures” for different types of online behaviors. In this way, they augment account demographic features to create a rich, high-fidelity information space for behavior mining and modeling.
The relative size and diversity of the account vocabulary
Content generated by automated means tends to reuse complex terms, while naturally generated content has a more varied vocabulary, and terms reused are generally simpler.
The word length mean and variance
Naturally generated content tends to use shorter but more varied language than automatically generated content.
The presence/percentage of chat-speak
Casual, social users often employ simple, easy to generate graphical icons, called emoticons. Sophisticated, non-social users tend to avoid these unsophisticated graphical icons.
The presence and frequency of hashtags
Hashtags are essentially topic words. Several hashtags taken together amount to a tweet “gist”. A table of these could be used for automated topic/content identification and categorization.
The number of misspelled words
It is assumed that sophisticated content generators, such as major retailers, will have a very low incidence of misspellings relative to casual users who are typing on a small device like a phone or tablet.
The presence of vulgarity
Major retailers are assumed to be unlikely to embed vulgarity in their content.
The use of hot-button words and phrases (“act now”, “enter to win”, etc.)
Marketing “code words” are regularly used to communicate complex ideas to potential customers in just a few words. Such phrases are useful precisely because they are hackneyed.
The use of words rarely used by other accounts (e.g., tf-idf scores)
Marketing campaigns often create words around their products. These created words occur nowhere else, and so will have high tf-idf scores, which is the term frequency–inverse document frequency score.
The presence of URL’s
To make a direct sale through a tweet, the customer must be engaged and directed to a location where a sale can be made. This is most easily accomplished by supplying a URL. URL’s, even tiny URL’s, can be automatically followed to facilitate screen scraping for identification/characterization.
The generation of redundant content (same tweets repeated multiple times)
It is costly and difficult to generate unique content for each of thousands of online recipients. Therefore, automated content (e.g., advertising) tends to have a relatively small number of stylized units of content that they use over and over. The result is an account with “redundant” content.
Content data (tweets) are returned (in the JSON structure) as character strings of length 1 to 140 characters. They may be in any language or no language at all. Tweets can contain any combination of free text, emoticons, chat-speak, hashtags, and URL’s. Twitter does not filter tweets for content (e.g., vulgarisms, hate speech).
For this study a sample of the activities of 8845 Twitter accounts containing the content of 1,048,395 tweets was collected for content analysis.
A vector of text features is derived for each user. This is accomplished by deriving text features for each of the user’s tweets and then rolling them up, i.e. summing and normalizing the data. Therefore, one content feature vector is derived for each user from all of that user’s tweets.
The extraction of numeric features from text is a multi-step process:
Collect the user’s most recent (up to 200) tweet strings into a single set (a Thread).
Convert the thread text to upper case for term matching.
Scan the thread for the presence of emoticons, chat-speak, hashtags, URL’s, and vulgarisms, setting bits to indicate the presence/absence of each of these text artifacts.
Remove special characters from the thread to facilitate term matching.
Create a Redundancy Score for the Thread. This is done by computing and rolling up (sum and normalize) the pairwise similarities of the tweet strings within the thread using six metrics: Euclidean Distance, RMS-Distance, L1 Distance, L-Infinity Distance, Cosine Distance, and the norm-weighted average of the five distances.
The thread text feature vector then contains as vector components user scores based on features such as the emoticon flag, the chat-speak flag, the hashtag flag, the URL flag, the vulgarity flag, and the Redundancy score.
A list of 23 potential content related features was created and calculated for each of the 8845 Twitter accounts in the sample (Tables 1 and 2).
For the purpose of classifying accounts as automated (bots) versus non-automated, a manual rating process of a sample of tweet content coming from 101 active accounts was executed. The sample was divided into 5 subsets with each set being rated by multiple volunteers who read the content of approximately 20 accounts in each subset, each subset containing a few thousand tweets. The rating of each account involved classification as a bot or not and also the assignment of a level of confidence associated with such classification, then a brief explanation of the main reasons was given for the relevant decisions. Of the 101 accounts, 65 were classified as 35 bot accounts and 30 non-bot accounts with a high level of confidence. Those 65 accounts were then assigned a dependent variable value of 1 if identified as a bot, and 0 otherwise.
Excel was used to generate a correlation matrix for the 23 content features for the large sample of 8845 feature vectors (Table 3).
Similarly, correlations between the 23 content features and the dependent variable for the small set of 65 accounts were calculated and sorted based on absolute value (Table 5).
Absolute values of the correlations between features and the dependent variable ranged from 0.003 to 0.603. Ranking such absolute values of correlations resulted in the following list of top predictors of bot-like behavior: “redund”, “urls”, “good_len”, “adj”, “tweets”, “vulgar”, “good_cnt”, “commnoun”, “emo_chat” and “art”.
Charts were created to examine the distributions of features that were deemed to be significant in terms of their correlation with the dependent variable in the small sample. Charts were created to examine joint distributions. Following some interpretation of the nature of distributions, some hypotheses were made as to potential statistical learning tools that may be useful in modeling based on such content features (Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11).
Approximately 10% of the 8845 accounts had the maximum level of activity measured (200 tweets). This may provide some lower bound estimate of the rate of accounts exhibiting bot-like behavior.
Examination of the content features correlation matrix reveals that correlations are generally low with some explainable exceptions. Features such as good_len and good_cnt refer to the number of characters that are part of correctly spelled words and the number of correctly spelled words, respectively. The high correlation of 0.86 is to be expected, and such is the case for bad_len and bad_cnt with a correlation of 0.841 (both highlighted in Table 4). In both situations, consideration may be given to selecting only one of each pair for the purpose of predictive modeling.
The top ten content features appear to contain discriminating information that may be relevant in an attempt to classify Twitter accounts as bot or non-bot accounts. Separation issues and the skewed nature of the majority of the distributions of content features may justify an expectation that a nonparametric approach may perform better than a parametric one.
The distribution of the redundancy scores appears to be approximately normal, while all other distributions examined are skewed. As in the case of an earlier study of external features, most relevant distributions that quantify social media behaviors do not appear to be normal, a fact that may later support preference for nonparametric modeling techniques or the application of some feature transformations.
Examination of the scatter plots of joint distributions seems to support the selection of the top content features listed above. One can note that in the case of vulgarity score there is no presence of vulgarity among the bot accounts, while non-bot accounts may or may not include vulgar language.
Taking all this into account, a starting set of content features that may be selected for modeling may involve the following nine features: redund, urls, good_len, adj, tweets, vulgar, commnoun, art, emo_chat.
A number of significant limitations must be noted.
First, the data set may not be a representative sample of the current state of affairs when it comes to bot versus non-bot activity in the Twitter medium.
Second, the process of manually classifying a small set of accounts and reaching a consensus in roughly two-thirds of the cases may not be without errors.
Third, a larger sample set from the manual classification process may lead to different conclusions about content features and the type of modeling that may be expected to perform best.
Fourth, concentrating on content, which probably provides the most predictive power, may still ignore some critical external features, and thus may not produce an optimal perspective.
4.3 Further Investigations
Future work may attempt to consider a mix of external features and content features, calculated on a large set of known bot and non-bot accounts for better feature selection, description, and classification. This should enable a much more reliable subset of predictive or discriminating features, which in turn may lead to more reliable descriptive and predictive models.
This paper demonstrates one way by which content of social media activities may be processed in terms of mathematical “signatures” of different types of online behaviors that may be used for descriptive and predictive modeling of automated versus non-automated activities.
Alarifi, A., Alsaleh, M., Al-Salman, A.: Twitter turing test: identifying social machines. Inf. Sci. 372, 332–346 (2016). doi:10.1016/j.ins.2016.08.036
Carapinha, F., et al.: Modeling of social media behaviors using only account metadata. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) AC 2016. LNCS (LNAI), vol. 9744, pp. 393–401. Springer, Cham (2016). doi:10.1007/978-3-319-39952-2_38
Chu, Z., Gianvecchio, S., Jajodia, S., Wang, H.: Detecting automation of Twitter accounts: are you a human, bot, or cyborg? IEEE Trans. Dependable Sec. Comput. 9, 811–824 (2012)
Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on Twitter: are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014) (2014). doi:10.1109/asonam.2014.6921650
A framework for twitter bot analysis. In: Proceedings of the 25th International Conference Companion on World Wide Web - WWW 2016 Companion (2016). doi:10.1145/2872518.2889360
Main, W., Shekokhar, N.: Twitterati Identification System (2015). http://www.sciencedirect.com/science/article/pii/S1877050915003129. Accessed 29 Jan 2017
Hancock, M.: Automating the characterization of social media culture, social context, and mood. In: 2014 Science of Multi-Intelligence Conference (SOMI), Chantilly, VA (2014)
Hancock, M., Sessions, C., Lo, C., Rajwani, S., Kresses, E., Bleasdale, C., Strohschein, D.: Stability of a type of cross-cultural emotion modeling in social media. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) AC 2015. LNCS (LNAI), vol. 9183, pp. 410–417. Springer, Cham (2015). doi:10.1007/978-3-319-20816-9_39
Editors and Affiliations
Rights and permissions
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Neumann, S. et al. (2017). Content Feature Extraction in the Context of Social Media Behavior. In: Schmorrow, D., Fidopiastis, C. (eds) Augmented Cognition. Neurocognition and Machine Learning. AC 2017. Lecture Notes in Computer Science(), vol 10284. Springer, Cham. https://doi.org/10.1007/978-3-319-58628-1_42
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58627-4
Online ISBN: 978-3-319-58628-1
eBook Packages: Computer ScienceComputer Science (R0)