Profiling Bot Accounts Mentioning COVID-19 Publications on Twitter
- 249 Downloads
This paper presents preliminary findings regarding automated bots mentioning scientific papers about COVID-19 publications on Twitter. A quantitative approach was adopted to characterize social and posting patterns of bots, in contrast to other users, in Twitter scholarly communication. Our findings indicate that bots play a prominent role in research dissemination and discussion on the social web. We observed 0.45% explicit bots in our sample, producing 2.9% of tweets. The results implicate that bots tweeted differently from non-bot accounts in terms of the volume and frequency of tweeting, the way handling the content of tweets, as well as preferences in article selection. In the meanwhile, their behavioral patterns may not be the same as Twitter bots in another context. This study contributes to the literature by enriching the understanding of automated accounts in the process of scholarly communication and demonstrating the potentials of bot-related studies in altmetrics research.
KeywordsTwitter Bot Network analysis Altmetrics research
The rapid development of technology has expanded the concept of scholarly communication beyond academic publishing to include informal and interactive research dissemination and discussion on the social web . In the meanwhile, altmetrics, metrics that capture the attention a scholarly work received on online platforms, have emerged as a supplement to traditional bibliometrics in assessing the broader impact of research .
As one of the primary social media platforms used among scientists and researchers , Twitter is a major source of altmetrics. Researchers have recognized the potential for tracing fast-paced conversations about academic literature . However, because of vulnerability of Twitter to bot activities, the validity of Twitter metric in accessing research impacts has been questioned by academic communities . Even bots have been prevalently observed in current studies about Twitter metrics [7, 8, 10], bots in the context of scholarly communication are still understudied.
To examine the implications of bot accounts, it is critical to understand their behavioral patterns in the process of research dissemination or scholarly communication. Building upon existing scholarship, this paper serves as a preliminary study to profile bot accounts tweeting academic scientific publications on Twitter. Taking recent COVID-19 publications as a case study, it aims to observe how bots, as well as other users, react to the latest scientific literature of trending topics on Twitter. First, we present a review of related works, followed by an exploratory analysis of social and posting patterns of bots in contrast to non-bot accounts. Next, we make further comparisons between bots and non-bot accounts by characterizing articles that they tweeted. In the concluding section, we discuss directions for future research based on our preliminary findings.
2 Related Works
Bot activities are evident in scholarly communication on Twitter. Existing studies found that a considerable proportion of the most productive users tweeting scientific publications and generating a large volume of tweets are automated accounts . For instance, Robinson-Garcia et al. reported that half of the top 25 Twitter users mentioning microbiology articles were bots, contributing to 4% of tweets in their sample . While in Haustein’s study, 15 out of the top 19 users citing academic articles across disciplines were self-identified bot accounts, each of which posted over 25,000 tweets on average . Another study, examining a large scale of altimetric data, spotted that the discrepancy between the number of posts and the number of unique users can reach over 30,000, and the researcher attributed this to excessive bot activities .
It is commonly believed that bots and human accounts behave differently. It was observed in Haustein’s study  that bots, specifically self-identified bots, are more engaged in tweeting scientific literature. It can be reflected by the higher volume and frequency of tweeting activities. Additionally, they may have a shorter tweet span, the number of days between the first and the last tweet. The researcher added that bots tweeting scholarly work may not share similar patterns as other Twitter bots, e.g., social bots in a generic context.
The issue of bots, specifically the extent of bots, have heavily caught attention from researchers. However, only a few studies have attempted to address the implications of bots on Twitter metrics and online academic communication. For instance, studying Twitter users with “arXiv” in their user names, handles, and Twitter bio, researchers found that over 80% accounts in their sample were automated platform feeds that push publication updates from arXiv, and topics feeds, i.e. automated feeds of publications relevant to a certain topic . Due to the homogeneous nature of bot accounts, Haustein and her colleagues suggested that automated tweets, regardless of being good bots or bad bots, may not imply impact but rather reflect diffusion . Adopting a network approach, Aljohani et al.  have demonstrated the significant role of bots in affecting the spread of the desired content in the altmetrics Twitter social network (ATSN). For example, bots were observed to be extensively used for research dissemination.
It was also identified that the degree distribution and community size distribution of an ATSN with a prevalent presence of bots tend to follow a power-law distribution .
There still lacks sufficient discussion on whether we should and how we can tackle the issue of bots to enhance the validity of Twitter metrics as alternative research impact indicators, such as identifying or eliminating bot accounts and bot-generated content. To fill this research gap and to facilitate the discussion, it is important to understand the role of bots in the process of scholarly communication on Twitter. To achieve this, a very first step could be characterizing the behavioral patterns of bot accounts in relevant activities.
3.1 Data Collection
First, reusing the query string constructed by Kousha and Thelwall , COVID-19 publications were retrieved from Scopus. To trace Twitter users’ reactions to the latest publications, we narrowed down the search results to English-written journal articles published in May 2020. To examine the characteristics of articles, articles without source title information were omitted. DOIs of articles were used to extract Twitter mentions from Altemtric.com. As our ultimate plan was to analyze the complete Twitter social networks at the article level, articles with less than 10 or over 100,000 Twitter mentions were excluded. We cross-checked Scopus, Altmetric.com, and Crossref API to retrieve the date when the article was first made available. 417 articles, from a variety of research areas (health Sciences: 69.96%, life sciences: 17.08%, social sciences & humanities: 5.14%, physical sciences: 4.94%, multidisciplinary: 2.88%), were retrieved. According to the statistics provided by Altmetric.com, these articles have been mentioned by 153,098 tweets and 100,620 unique users as of the date of data collection, June 22, 2020.
Utilizing Twitter API, we further collected information about tweets retrieved and Twitter users. As some tweets and user accounts are not active anymore, 89,258 user profiles and 139,298 unique tweets with user profiles available were retrieved from Twitter API. 131 tweets have mentioned multiple articles. Analysis in this paper was based on this matched set of data.
3.2 Identifying Explicit Bot Accounts
Twitter bots are software designed to autonomously perform Twitter activities, such as tweeting, retweeting, following, and replying via Twitter API without human judgment and selection [13, 14]. In this preliminary study, only explicit bots, including 1) self-identified bots and possible 2) spambots, were covered.
(“bot” | “robot” | “tweetbot” | “tweet bot” | “twitterbot” | “tweeter bot” | “a *robot”) & NOT (“bot hate” | “bot sniper” | “block *bot” | “not a *bot” | “nor a *bot” | “neither a *bot“| “like a *bot” | “sometimes*bot” | “think i am a bot” | “roboti*”)) | (“automat*” & (“alert” | “update” | “feed” | “link” | “news” | “stream” | “script” | “tweet”) & NOT (“no* auto”)) | ((“article” | “literature” | “paper” | “peer-review*” | “preprint” | “publications” | “pubmed” | “arxiv” | “biorxiv” | “medrxiv”) & (“alert” | “update” | “feed” | “links” | “stream”) & NOT (“editor” | “journalist” | “official” | “feeding” | “links between”)) | ((“created by” | “developed by” | “programmed”) & (“share” | “daily” | “latest” | “news” | “podcast”)) | (“aggregator” | “news feed” | “datafeed” | “new submissions” | “latest publications” | “new publication” | “daily updates”)
(“bot” & NOT “tweetbot for *” & NOT “roboti*”) | (“paper” & NOT (“paper.li” | “instapaper”)) | “retweet” | “update” | “alert” | “auto” | “curat*” | “aggregat*” | “combinator” | “feed” | “arxiv” | “biorxiv” | “medrxiv” | “journal” | “article” | “preprint” | “RT”
Next, we extracted a list of potential spambots by identifying accounts that have posted the same content more than three times and above.
Lastly, we did a manual check on profiles of identified bot accounts and removed selections that were explicitly wrong, e.g., “bot” as a part of the real name of the user or other word combinations not covered in the query string such as “bot killer”, “blocked by bot”, “not a creeper or a robot”, etc.
As a result, 400 explicit bot accounts were identified, accounted for 0.45% of users in our dataset. It is worth highlighting that these accounts have contributed to around 2.9% of all selected tweets citing COVID-19 publications on Twitter. To facilitate the comparison of patterns of explicit between bot accounts (bots) and non-explicit bot accounts (non-bots), we randomly selected 400 accounts from the unclassified users. Similarly, we read through their user description to ensure that no explicit bot was included.
A major limitation of our method is that we were not able to measure the performance of the classification. The proportion of bot accounts was underestimated. On the one hand, only explicit bots were covered while those less explicit or more intelligent bots remain unclassified. On the other hand, as the set of predefined keywords are not exhaustive, the above-mentioned bot identification strategy may not be ideal regarding the rate of recall. It is also possible that cyborgs, bot-assisted humans or human-assisted bots, may fall under the category of bot accounts if they presented a high level of automation in the sample data. However, taking basic manual validation as a measure to enhance the precision, our study can still serve its purpose, as a preliminary study, to capture some patterns of bot accounts in Twitter scholarly communication.
3.3 Data Analysis
The data analysis can be divided into two parts. First, social and tweeting patterns of bots were studied in contrast to non-bots in our sample. Second, we characterized articles tweeted by both groups. Statistical analysis, such as Mann-Whitney U tests and chi-squared tests, was performed to compare the difference between bots and non-bots using SciPy in Python.
Selected features for bots and non-bots comparison
Account age (days), Length of user bio (with URLs removed), Number of statuses, Favorites-statuses ratio, Followers-friends ratio, Number of listed
Number of tweets, Average number of tweets per article, Average responding time to publications, Average number of tweets per day, Retweet-tweet ratio, Number of hashtags per tweet, Number of @mentions per tweet
To better understand bot accounts’ tweeting patterns when mentioning scientific publications on Twitter, we also made efforts to characterize articles tweeted by bots and non-bots. Characteristics we analyzed include the open access status of the article, the impact of the source title, and their subject areas.
Bots and non-bots may have different preferences when tweeting academic articles as well. For instance, bots were more likely to mention open access (OA) articles than non-bots with 96.6% and 95% of tweets mentioning OA articles respectively, x2 = 4.92, p < .05. Though statistical significance was observed in the chi-square test when comparing the tweet distribution between bots and non-bots by SJR Quartiles, x2 = 11.11, p < .05, they both have strongly preferred articles from high impacts journals. On average, non-bots have over 92.5% tweets mentioning articles from Q1 journals, while this percentage among bots was also high as 90.8%. Regarding the discipline of articles tweeted, as COVID-19 is considered as a health crisis, both groups have paid great attention to articles in medical and health sciences, though non-bots have shown a higher variety of interest in articles interpreting COVID-19 from social science, public health, and multidisciplinary perspectives.
5 Discussion and Conclusion
Consistent with existing studies, our findings have indicated the prevalence of bot-generated tweets. In general, bots are younger and tweeted more than non-bots in a more frequent manner. As observed, a considerable proportion of them were created to share news and articles about COVID-19. Though bots may not excessively employ hashtags and @mentions in terms of the rate of relevant tweets, they added a slightly larger number of hashtags and @mentions than non-bots. Second, bots mentioning scientific publications tweeted differently from non-bots and bots in a generic context. For instance, bots in our sample retweeted less than non-bots and have a relatively higher followers-friends ratio. In addition, bots and non-bots seem not to share common content selection criteria though they both prefer articles from high impact journals.
To summarize, the potentials of bots in affecting the validity of Twitter metrics in assessing research should be recognized. Sharing similar features of spambots , bots in our sample were found to generate excessive Twitter activities mentioning academic works. Moreover, with a relatively higher followers-friends ratio and efforts in employing hashtags and @mentions, bots may have the power to affect the process of communication. Additionally, automated algorithms to select articles based on specific criteria, e.g., specific research disciplines of source titles and journal impact factors, may result in bias of Twitter metrics. However, it is encouraging to observe original tweets, likely sourcing from external platforms , among automated bots. In other words, bots may serve as idea starters and positively contribute to scholarly communication on the social web.
A major limitation of this study is that we did not have a manually labeled dataset, and the dataset that we used for analysis was relatively small. Other than that, without in-depth text analysis and more granular account classification, we cannot tell whether the implications of automated accounts are positive or negative. Also, as we focused on publications related to COVID-19, a trending topic on Twitter, our findings may not be applicable to publications on other topics. However, from this study, we do see the potential, as well as the necessity, of bot-related studies in the context of online scholarly communication. With data on a larger scale and more sophisticated bot detection techniques, we will be able to present more rigorous findings regarding the implications of automated bots. Attaining a comprehensive understanding of the role of bots in the process of scholarly communication will be beneficial for us to further assess Twitter metrics, and figure out possible solutions to enhance their validity as research impact indicators.
- 4.Hassan, S.-U., Imran, M., Gillani, U., Aljohani, N.R., Bowman, T.D., Didegah, F.: Measuring social media activity of scientific literature: an exhaustive comparison of scopus and novel altmetrics big data. Scientometrics 113(2), 1037–1057 (2017). https://doi.org/10.1007/s11192-017-2512-xCrossRefGoogle Scholar
- 5.Darling, E., Shiffman, D., Côté, I., Drew, J.: The role of Twitter in the life cycle of a scientific publication. Ideas Ecol. Evol. 6 (2013). https://doi.org/10.4033/iee.2013.6.6.f
- 7.Robinson-Garcia, N., Arroyo-Machado, W., Torres-Salinas, D.: Mapping social media attention in Microbiology: identifying main topics and actors. FEMS Microbiol. Lett. 366 (2019). https://doi.org/10.1093/femsle/fnz075
- 8.Haustein, S.: Scholarly Twitter metrics. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds.) Handbook of Quantitative Science and Technology Research (2018). https://arxiv.org/abs/1806.02201
- 9.Haustein, S., Bowman, T.D., Holmberg, K., Tsou, A., Sugimoto, C.R., Larivière, V.: Tweets as impact indicators: examining the implications of automated “bot” accounts on Twitter. J. Assoc. Inf. Sci. Technol. (2016). https://doi.org/10.1002/asi.23456
- 11.Haustein, S., Toupin, R., Alperin, J.P.: “Not sure if scientist or just Twitter bot” Or: who tweets about scholarly papers (2018). https://www.altmetric.com/blog/not-sure-if-scientist-or-just-twitter-bot-or-who-tweets-about-scholarly-papers/
- 13.Kousha, K., Thelwall, M.: COVID-19 publications: database coverage, citations, readers, tweets, news, Facebook walls, Reddit posts. Quant. Sci. Stud. 1–24 (2020). https://doi.org/10.1162/qss_a_00066
- 15.Kantepe, M., Ganiz, M.C.: Preprocessing framework for Twitter bot detection. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 630–634. IEEE (2017). https://doi.org/10.1109/UBMK.2017.8093483
- 18.Gilani, Z., Kochmar, E., Crowcroft, J.: Classification of Twitter accounts into automated agents and human users. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 489-496 (2017). https://doi.org/10.1145/3110025.3110091
- 19.Sedhai, S., Sun, A.: HSpam14: a collection of 14 million tweets for hashtag-oriented spam research. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223–232 (2015). https://doi.org/10.1145/2766462.2767701