Advertisement

Profiling Bot Accounts Mentioning COVID-19 Publications on Twitter

Conference paper
  • 249 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12504)

Abstract

This paper presents preliminary findings regarding automated bots mentioning scientific papers about COVID-19 publications on Twitter. A quantitative approach was adopted to characterize social and posting patterns of bots, in contrast to other users, in Twitter scholarly communication. Our findings indicate that bots play a prominent role in research dissemination and discussion on the social web. We observed 0.45% explicit bots in our sample, producing 2.9% of tweets. The results implicate that bots tweeted differently from non-bot accounts in terms of the volume and frequency of tweeting, the way handling the content of tweets, as well as preferences in article selection. In the meanwhile, their behavioral patterns may not be the same as Twitter bots in another context. This study contributes to the literature by enriching the understanding of automated accounts in the process of scholarly communication and demonstrating the potentials of bot-related studies in altmetrics research.

Keywords

Twitter Bot Network analysis Altmetrics research 

1 Introduction

The rapid development of technology has expanded the concept of scholarly communication beyond academic publishing to include informal and interactive research dissemination and discussion on the social web [1]. In the meanwhile, altmetrics, metrics that capture the attention a scholarly work received on online platforms, have emerged as a supplement to traditional bibliometrics in assessing the broader impact of research [2].

As one of the primary social media platforms used among scientists and researchers [3], Twitter is a major source of altmetrics. Researchers have recognized the potential for tracing fast-paced conversations about academic literature [4]. However, because of vulnerability of Twitter to bot activities, the validity of Twitter metric in accessing research impacts has been questioned by academic communities [5]. Even bots have been prevalently observed in current studies about Twitter metrics [7, 8, 10], bots in the context of scholarly communication are still understudied.

To examine the implications of bot accounts, it is critical to understand their behavioral patterns in the process of research dissemination or scholarly communication. Building upon existing scholarship, this paper serves as a preliminary study to profile bot accounts tweeting academic scientific publications on Twitter. Taking recent COVID-19 publications as a case study, it aims to observe how bots, as well as other users, react to the latest scientific literature of trending topics on Twitter. First, we present a review of related works, followed by an exploratory analysis of social and posting patterns of bots in contrast to non-bot accounts. Next, we make further comparisons between bots and non-bot accounts by characterizing articles that they tweeted. In the concluding section, we discuss directions for future research based on our preliminary findings.

2 Related Works

Bot activities are evident in scholarly communication on Twitter. Existing studies found that a considerable proportion of the most productive users tweeting scientific publications and generating a large volume of tweets are automated accounts [6]. For instance, Robinson-Garcia et al. reported that half of the top 25 Twitter users mentioning microbiology articles were bots, contributing to 4% of tweets in their sample [7]. While in Haustein’s study, 15 out of the top 19 users citing academic articles across disciplines were self-identified bot accounts, each of which posted over 25,000 tweets on average [8]. Another study, examining a large scale of altimetric data, spotted that the discrepancy between the number of posts and the number of unique users can reach over 30,000, and the researcher attributed this to excessive bot activities [10].

It is commonly believed that bots and human accounts behave differently. It was observed in Haustein’s study [8] that bots, specifically self-identified bots, are more engaged in tweeting scientific literature. It can be reflected by the higher volume and frequency of tweeting activities. Additionally, they may have a shorter tweet span, the number of days between the first and the last tweet. The researcher added that bots tweeting scholarly work may not share similar patterns as other Twitter bots, e.g., social bots in a generic context.

The issue of bots, specifically the extent of bots, have heavily caught attention from researchers. However, only a few studies have attempted to address the implications of bots on Twitter metrics and online academic communication. For instance, studying Twitter users with “arXiv” in their user names, handles, and Twitter bio, researchers found that over 80% accounts in their sample were automated platform feeds that push publication updates from arXiv, and topics feeds, i.e. automated feeds of publications relevant to a certain topic [9]. Due to the homogeneous nature of bot accounts, Haustein and her colleagues suggested that automated tweets, regardless of being good bots or bad bots, may not imply impact but rather reflect diffusion [11]. Adopting a network approach, Aljohani et al. [12] have demonstrated the significant role of bots in affecting the spread of the desired content in the altmetrics Twitter social network (ATSN). For example, bots were observed to be extensively used for research dissemination.

It was also identified that the degree distribution and community size distribution of an ATSN with a prevalent presence of bots tend to follow a power-law distribution [12].

There still lacks sufficient discussion on whether we should and how we can tackle the issue of bots to enhance the validity of Twitter metrics as alternative research impact indicators, such as identifying or eliminating bot accounts and bot-generated content. To fill this research gap and to facilitate the discussion, it is important to understand the role of bots in the process of scholarly communication on Twitter. To achieve this, a very first step could be characterizing the behavioral patterns of bot accounts in relevant activities.

3 Method

3.1 Data Collection

First, reusing the query string constructed by Kousha and Thelwall [13], COVID-19 publications were retrieved from Scopus. To trace Twitter users’ reactions to the latest publications, we narrowed down the search results to English-written journal articles published in May 2020. To examine the characteristics of articles, articles without source title information were omitted. DOIs of articles were used to extract Twitter mentions from Altemtric.com. As our ultimate plan was to analyze the complete Twitter social networks at the article level, articles with less than 10 or over 100,000 Twitter mentions were excluded. We cross-checked Scopus, Altmetric.com, and Crossref API to retrieve the date when the article was first made available. 417 articles, from a variety of research areas (health Sciences: 69.96%, life sciences: 17.08%, social sciences & humanities: 5.14%, physical sciences: 4.94%, multidisciplinary: 2.88%), were retrieved. According to the statistics provided by Altmetric.com, these articles have been mentioned by 153,098 tweets and 100,620 unique users as of the date of data collection, June 22, 2020.

Utilizing Twitter API, we further collected information about tweets retrieved and Twitter users. As some tweets and user accounts are not active anymore, 89,258 user profiles and 139,298 unique tweets with user profiles available were retrieved from Twitter API. 131 tweets have mentioned multiple articles. Analysis in this paper was based on this matched set of data.

3.2 Identifying Explicit Bot Accounts

Twitter bots are software designed to autonomously perform Twitter activities, such as tweeting, retweeting, following, and replying via Twitter API without human judgment and selection [13, 14]. In this preliminary study, only explicit bots, including 1) self-identified bots and possible 2) spambots, were covered.

This section presents the method used to distinguish bot accounts from the collected data. First, the method introduced in Haustein’s study [8] was adopted to identify self-identified bots by searching a pre-defined set of keywords from users’ user names, handles, and bio on Twitter. We made minor changes to the original query string to include stricter criteria (see below).

(“bot” | “robot” | “tweetbot” | “tweet bot” | “twitterbot” | “tweeter bot” | “a *robot”) & NOT (“bot hate” | “bot sniper” | “block *bot” | “not a *bot” | “nor a *bot” | “neither a *bot“| “like a *bot” | “sometimes*bot” | “think i am a bot” | “roboti*”)) | (“automat*” & (“alert” | “update” | “feed” | “link” | “news” | “stream” | “script” | “tweet”) & NOT (“no* auto”)) | ((“article” | “literature” | “paper” | “peer-review*” | “preprint” | “publications” | “pubmed” | “arxiv” | “biorxiv” | “medrxiv”) & (“alert” | “update” | “feed” | “links” | “stream”) & NOT (“editor” | “journalist” | “official” | “feeding” | “links between”)) | ((“created by” | “developed by” | “programmed”) & (“share” | “daily” | “latest” | “news” | “podcast”)) | (“aggregator” | “news feed” | “datafeed” | “new submissions” | “latest publications” | “new publication” | “daily updates”)

Second, extracting the value of the “source” field, the device or application from which a tweet was posted, of tweet objects from Twitter API, we identified accounts that employ bot clients using the query string below. We did not consider third-party social media marketing or management applications, such as Tweetdeck, IFTTT, and Hootsuite as bot clients as there might involve human selection of content.

(“bot” & NOT “tweetbot for *” & NOT “roboti*”) | (“paper” & NOT (“paper.li” | “instapaper”)) | “retweet” | “update” | “alert” | “auto” | “curat*” | “aggregat*” | “combinator” | “feed” | “arxiv” | “biorxiv” | “medrxiv” | “journal” | “article” | “preprint” | “RT”

Next, we extracted a list of potential spambots by identifying accounts that have posted the same content more than three times and above.

Lastly, we did a manual check on profiles of identified bot accounts and removed selections that were explicitly wrong, e.g., “bot” as a part of the real name of the user or other word combinations not covered in the query string such as “bot killer”, “blocked by bot”, “not a creeper or a robot”, etc.

As a result, 400 explicit bot accounts were identified, accounted for 0.45% of users in our dataset. It is worth highlighting that these accounts have contributed to around 2.9% of all selected tweets citing COVID-19 publications on Twitter. To facilitate the comparison of patterns of explicit between bot accounts (bots) and non-explicit bot accounts (non-bots), we randomly selected 400 accounts from the unclassified users. Similarly, we read through their user description to ensure that no explicit bot was included.

A major limitation of our method is that we were not able to measure the performance of the classification. The proportion of bot accounts was underestimated. On the one hand, only explicit bots were covered while those less explicit or more intelligent bots remain unclassified. On the other hand, as the set of predefined keywords are not exhaustive, the above-mentioned bot identification strategy may not be ideal regarding the rate of recall. It is also possible that cyborgs, bot-assisted humans or human-assisted bots, may fall under the category of bot accounts if they presented a high level of automation in the sample data. However, taking basic manual validation as a measure to enhance the precision, our study can still serve its purpose, as a preliminary study, to capture some patterns of bot accounts in Twitter scholarly communication.

3.3 Data Analysis

The data analysis can be divided into two parts. First, social and tweeting patterns of bots were studied in contrast to non-bots in our sample. Second, we characterized articles tweeted by both groups. Statistical analysis, such as Mann-Whitney U tests and chi-squared tests, was performed to compare the difference between bots and non-bots using SciPy in Python.

As suggested in existing scholarship, social patterns of a Twitter user can be generalized from information such as the number of followers and the number of friends, whereas tweeting patterns can be examined from aspects such as sources of contents, mediums of actions, patterns of contents (e.g., the use of hashtags and user mentions in tweets), as well as the timing (e.g., the frequency of tweets or retweets) [15, 16, 17, 18]. Drawing upon these studies, we compiled a list of features commonly used in Twitter bot detection studies. Table 1 presents features that were tested in this study.
Table 1.

Selected features for bots and non-bots comparison

Type

Features

User feature

Account age (days), Length of user bio (with URLs removed), Number of statuses, Favorites-statuses ratio, Followers-friends ratio, Number of listed

Tweet feature

Number of tweets, Average number of tweets per article, Average responding time to publications, Average number of tweets per day, Retweet-tweet ratio, Number of hashtags per tweet, Number of @mentions per tweet

To better understand bot accounts’ tweeting patterns when mentioning scientific publications on Twitter, we also made efforts to characterize articles tweeted by bots and non-bots. Characteristics we analyzed include the open access status of the article, the impact of the source title, and their subject areas.

4 Results

As our sample data is not normally distributed, Mann-Whitney U tests were performed. Figure 1 shows the comparison between bots and non-bots regarding their user features. It is consistent with existing studies that bots have generated a higher volume of tweet statuses (Mdn = 18,127) than non-bot accounts (Mdn = 7,996.5), U = 64,550, p < .01. Our results have also indicated the difference of favourite-statuses ratios between bots (Mdn = .38) and non-bots (Mdn = .91), U = 55691, p < .01 [18]. It is noteworthy that without considering the popularity of the account, bots in our sample are younger in age (Mdn = 1,785.5) than non-bots (Mdn = 2,378), U = 60,770, p < .01. 13.25% of the selected bot accounts were born in 2020, out of which 18 accounts were dedicated to aggregate or share updates and related publications about COVID-19 according to their user profiles. Examples include “a bot sharing info from the CDC about #COVID19”, “automatically post papers about the coronavirus…”, “a bot tweeting people’s #mentalhealth during #COVID”, etc. In contrast, only 4.75% of non-bots were created in 2020. However, it is interesting to observe that bots have a higher ratio of followers to friends (Mdn = .85) than non-bots (Mdn = .69), U = 74,055, p < .05. This implicates the possibility of bots being influential in Twitter scholarly communication, serving the function of information dissemination.
Fig. 1.

Users feature: bots vs. non-bots

As shown in Fig. 2, it is evident that bots tweeted academic articles more often than non-bots. This is consistent with the observations in existing scholarship [8]. First, bots generated a larger tweet volume mentioning COVID-19 publications (Mdn = 4) than non-bots (Mdn = 1), U = 35,594.5, p < .01. The number of tweets generated by the sampled bot accounts can be high as 193, while the maximum among non-bots was 21. Moreover, bots tended to mention the same article more (Mdn = 1.33) than non-bots (Mdn = 1.00, U = 42016.5, p < .01. Users’ average responding time to the newly published articles has reflected the inaccuracy of articles’ date of availability as negative values were observed. To tackle this, for each article, we assigned dense rank to each user based on the time when they first reacted to the article and compared their average ranking. As a result, we found that non-bots have been responding to the articles faster than bots on average as they were ranked higher (Mdn = 3) than bots (Mdn = 4), U = 64,980.5, p < .01. Possible reasons include 1) bots are running on predefined schedules and not all of them were designed to catch up with the published articles promptly.
Fig. 2.

Tweets mentioning COVID-19 publications

Regarding the patterns of tweets shown in Fig. 3, bot accounts seemed to have used more hashtags (Mdn = 0) than non-bots (Mdn = 0), U = 69,010, p < .01. Similarly, more @mentions were added in a tweet among bots (Mdn = 1) than non-bots (Mdn = 1), U = 72,056, p < .01. This corresponds to a finding in existing studies that spam tweets may use slightly higher number of mentions and hashtags [19]. The top 10 hashtags most commonly used by bot-generated tweets incudes #COVID19, #SARSCoV2, #coronavirus, #covid19, #hydoxychloroquine, #COVID-19, #chloroquine, #COVID, #Covid19, and #Covid_19. Similarly, the top 10 hashtags used among non-bot accounts were also variants of COVID-19 and related medication, including #COVID19, #SARSCoV2, #covid19, #coronavirus, #Ritonavir, #Covid19, #Lopinavir, #Kaletra, #Hospitalized, and #Coronavirus.
Fig. 3.

Tweet features (1): bots vs. non-bots

As shown in Fig. 4, when comparing the tweeting patterns among selected accounts who posted more than 1 tweet in our sample (Nnon-bots = 89, Nbots = 271), significant difference was observed in the average number of tweets per day and the retweet-tweet ratio. A bot is likely to generate more tweets (Mdn = 1.75) than a non-bot (Mdn = 1.20) on a single day, U = 2,610.5, p < .01. Surprisingly, bot accounts have a lower retweet-tweet ratio (Mdn = .15) than non-bots in our sample (Mdn = .75), U = 2,500.5, p < .01. This might be different from a Twitter bot that aggressively retweets in a generic context [14].
Fig. 4.

Tweet features (2): bots vs. non-bots

Bots and non-bots may have different preferences when tweeting academic articles as well. For instance, bots were more likely to mention open access (OA) articles than non-bots with 96.6% and 95% of tweets mentioning OA articles respectively, x2 = 4.92, p < .05. Though statistical significance was observed in the chi-square test when comparing the tweet distribution between bots and non-bots by SJR Quartiles, x2 = 11.11, p < .05, they both have strongly preferred articles from high impacts journals. On average, non-bots have over 92.5% tweets mentioning articles from Q1 journals, while this percentage among bots was also high as 90.8%. Regarding the discipline of articles tweeted, as COVID-19 is considered as a health crisis, both groups have paid great attention to articles in medical and health sciences, though non-bots have shown a higher variety of interest in articles interpreting COVID-19 from social science, public health, and multidisciplinary perspectives.

Figure 5 presents the top 50 keywords in tweets generated by bots and non-bots. The text size reflects the ranking of word frequency within each group. It is not difficult to tell that both bots and non-bots have paid attention to scientific outputs related to COIVD-19, represented by keywords, such as “article”, “study”, “paper”, etc. Both groups may have interpreted COVID-19 as “pandemic”. It was also of their interest to monitor updates about the growth of “cases”. Non-bots seemed to have a particular interest in the related meditation, e.g., “lopinavir”, and “ritonavir” which may not be a top concern among bots. There may also exist political bots that mention scientific articles as “trump” seems to be frequently mentioned by bot accounts in our sample.
Fig. 5.

Top 50 keywords in tweets

5 Discussion and Conclusion

Consistent with existing studies, our findings have indicated the prevalence of bot-generated tweets. In general, bots are younger and tweeted more than non-bots in a more frequent manner. As observed, a considerable proportion of them were created to share news and articles about COVID-19. Though bots may not excessively employ hashtags and @mentions in terms of the rate of relevant tweets, they added a slightly larger number of hashtags and @mentions than non-bots. Second, bots mentioning scientific publications tweeted differently from non-bots and bots in a generic context. For instance, bots in our sample retweeted less than non-bots and have a relatively higher followers-friends ratio. In addition, bots and non-bots seem not to share common content selection criteria though they both prefer articles from high impact journals.

To summarize, the potentials of bots in affecting the validity of Twitter metrics in assessing research should be recognized. Sharing similar features of spambots [19], bots in our sample were found to generate excessive Twitter activities mentioning academic works. Moreover, with a relatively higher followers-friends ratio and efforts in employing hashtags and @mentions, bots may have the power to affect the process of communication. Additionally, automated algorithms to select articles based on specific criteria, e.g., specific research disciplines of source titles and journal impact factors, may result in bias of Twitter metrics. However, it is encouraging to observe original tweets, likely sourcing from external platforms [9], among automated bots. In other words, bots may serve as idea starters and positively contribute to scholarly communication on the social web.

A major limitation of this study is that we did not have a manually labeled dataset, and the dataset that we used for analysis was relatively small. Other than that, without in-depth text analysis and more granular account classification, we cannot tell whether the implications of automated accounts are positive or negative. Also, as we focused on publications related to COVID-19, a trending topic on Twitter, our findings may not be applicable to publications on other topics. However, from this study, we do see the potential, as well as the necessity, of bot-related studies in the context of online scholarly communication. With data on a larger scale and more sophisticated bot detection techniques, we will be able to present more rigorous findings regarding the implications of automated bots. Attaining a comprehensive understanding of the role of bots in the process of scholarly communication will be beneficial for us to further assess Twitter metrics, and figure out possible solutions to enhance their validity as research impact indicators.

References

  1. 1.
    Sugimoto, C.R., Work, S., Larivière, V., Haustein, S.: Scholarly use of social media and altmetrics: a review of the literature. J. Assoc. Inf. Sci. Technol. 68, 2037–2062 (2017).  https://doi.org/10.1002/asi.23833CrossRefGoogle Scholar
  2. 2.
    Robinson-Garcia, N., van Leeuwen, T.N., Rafols, I.: Using altmetrics for contextualised mapping of societal impact: from hits to networks. Sci. Public Policy 45, 815–826 (2018).  https://doi.org/10.1093/scipol/scy024CrossRefGoogle Scholar
  3. 3.
    Van Noorden, R.: Online collaboration: scientists and the social network. Nature 512, 126–129 (2014).  https://doi.org/10.1038/512126aCrossRefGoogle Scholar
  4. 4.
    Hassan, S.-U., Imran, M., Gillani, U., Aljohani, N.R., Bowman, T.D., Didegah, F.: Measuring social media activity of scientific literature: an exhaustive comparison of scopus and novel altmetrics big data. Scientometrics 113(2), 1037–1057 (2017).  https://doi.org/10.1007/s11192-017-2512-xCrossRefGoogle Scholar
  5. 5.
    Darling, E., Shiffman, D., Côté, I., Drew, J.: The role of Twitter in the life cycle of a scientific publication. Ideas Ecol. Evol. 6 (2013).  https://doi.org/10.4033/iee.2013.6.6.f
  6. 6.
    Robinson-Garcia, N., Costas, R., Isett, K., Melkers, J., Hicks, D.: The unbearable emptiness of tweeting—about journal articles. PLoS ONE 12, e0183551 (2017).  https://doi.org/10.1371/journal.pone.0183551CrossRefGoogle Scholar
  7. 7.
    Robinson-Garcia, N., Arroyo-Machado, W., Torres-Salinas, D.: Mapping social media attention in Microbiology: identifying main topics and actors. FEMS Microbiol. Lett. 366 (2019).  https://doi.org/10.1093/femsle/fnz075
  8. 8.
    Haustein, S.: Scholarly Twitter metrics. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds.) Handbook of Quantitative Science and Technology Research (2018). https://arxiv.org/abs/1806.02201
  9. 9.
    Haustein, S., Bowman, T.D., Holmberg, K., Tsou, A., Sugimoto, C.R., Larivière, V.: Tweets as impact indicators: examining the implications of automated “bot” accounts on Twitter. J. Assoc. Inf. Sci. Technol. (2016).  https://doi.org/10.1002/asi.23456
  10. 10.
    Yu, H.: Context of altmetrics data matters: an investigation of count type and user category. Scientometrics 111, 267–283 (2017).  https://doi.org/10.1007/s11192-017-2251-zCrossRefGoogle Scholar
  11. 11.
    Haustein, S., Toupin, R., Alperin, J.P.: “Not sure if scientist or just Twitter bot” Or: who tweets about scholarly papers (2018). https://www.altmetric.com/blog/not-sure-if-scientist-or-just-twitter-bot-or-who-tweets-about-scholarly-papers/
  12. 12.
    Aljohani, N.R., Fayoumi, A., Hassan, S.-U.: Bot prediction on social networks of Twitter in altmetrics using deep graph convolutional networks. Soft. Comput. 24(15), 11109–11120 (2020).  https://doi.org/10.1007/s00500-020-04689-yCrossRefGoogle Scholar
  13. 13.
    Kousha, K., Thelwall, M.: COVID-19 publications: database coverage, citations, readers, tweets, news, Facebook walls, Reddit posts. Quant. Sci. Stud. 1–24 (2020).  https://doi.org/10.1162/qss_a_00066
  14. 14.
    Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans. Dependable Secur. Comput. 9, 811–824 (2012).  https://doi.org/10.1109/TDSC.2012.75CrossRefGoogle Scholar
  15. 15.
    Kantepe, M., Ganiz, M.C.: Preprocessing framework for Twitter bot detection. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 630–634. IEEE (2017).  https://doi.org/10.1109/UBMK.2017.8093483
  16. 16.
    Oentaryo, R.J., Murdopo, A., Prasetyo, P.K., Lim, E.-P.: On profiling bots in social media. In: Spiro, E., Ahn, Y.-Y. (eds.) SocInfo 2016. LNCS, vol. 10046, pp. 92–109. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-47880-7_6CrossRefGoogle Scholar
  17. 17.
    Kudugunta, S., Ferrara, E.: Deep neural networks for bot detection. Inf. Sci. 467, 312–322 (2018).  https://doi.org/10.1016/j.ins.2018.08.019CrossRefGoogle Scholar
  18. 18.
    Gilani, Z., Kochmar, E., Crowcroft, J.: Classification of Twitter accounts into automated agents and human users. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, pp. 489-496 (2017).  https://doi.org/10.1145/3110025.3110091
  19. 19.
    Sedhai, S., Sun, A.: HSpam14: a collection of 14 million tweets for hashtag-oriented spam research. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223–232 (2015).  https://doi.org/10.1145/2766462.2767701

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Wee Kim Wee School of Communication and InformationNanyang Technological UniversitySingaporeSingapore

Personalised recommendations