In computational social science, two parallel research directions exploring – news consumption patterns and linguistic regularities – have made significant inroads into better understanding complex political polarization in the era of ubiquitous internet. However, little or no literature exists that presented a unified treatment combining both these research directions. When working on social events from countries that do not speak English as a first language, computational linguistic resource availability is often a barrier to sophisticated linguistic analyses. In this work, we analyze an important sociopolitical event, the 2019 South American protests, and demonstrate that (1) a combined treatment offers a more comprehensive understanding of the event; and (2) these cross-cutting methods can be applied in a synergistic way. The insights gained by the combination of these methods include that polarization in users’ news sharing patterns was consistent with their stances towards the government and that polarization in their language mainly manifested along ideological, political, or protest-related lines. In addition, we release a massive dataset of 15 million tweets relevant to this crisis.
- Linguistic polarization
- News-sharing polarization
- South American protests
This is a preview of subscription content, access via your institution.
Publicly available at: https://doi.org/10.5281/zenodo.6213032.
This refers to the decree 883 which proposed austerity measures and started the protests in the country.
An outlet can be retweeted either directly or indirectly via a tweet originating from their account, or a third party tweet containing a url with their domain.
Note to reviewers: to uphold the anonymization policies for submission, we will make the link publicly available before publication.
Alhazmi, K., Alsumari, W., Seppo, I., Podkuiko, L., Simon, M.: Effects of annotation quality on model performance. In: 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 063–067 (2021)
Babcock, M., Cox, R.V.C., Kumar, S.: Diffusion of pro-and anti-false information tweets: the black panther movie case. Comput. Math. Organ. Theory 25(1), 72–84 (2019)
Babcock, M., Villa-Cox, R., Carley, K.M.: Pretending positive, pushing false: comparing captain marvel misinformation campaigns. In: Shu, K., Wang, S., Lee, D., Liu, H. (eds.) Disinformation, Misinformation, and Fake News in Social Media. LNSN, pp. 83–94. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-42699-6_5
Baldwin, M., Lammers, J.: Past-focused environmental comparisons promote proenvironmental outcomes for conservatives. Proc. Natl. Acad. Sci. 113(52), 14953–14957 (2016)
Barberá, P., et al.: The critical periphery in the growth of social protests. PLoS ONE 10(11), e0143611 (2015)
Beguerisse-Díaz, M., Garduno-Hernández, G., Vangelov, B., Yaliraki, S.N., Barahona, M.: Interest communities and flow roles in directed networks: the twitter network of the UK riots. J. R. Soc. Interface 11(101), 20140940 (2014)
Darwish, K.: Quantifying polarization on twitter: the Kavanaugh nomination. arXiv abs/2001.02125 (2020)
Del, M., et al.: The spreading of misinformation online. Proc. Natl. Acad. Sci. 113(3), 554–559 (2016)
Demszky, D., et al.: Analyzing polarization in social media: method and application to tweets on 21 mass shootings. In: NAACL-HLT 2019, pp. 2970–3005. Association for Computational Linguistics (2019)
Evans, A.: Stance and identity in twitter hashtags. Lang. Internet 13(1) (2016)
Fisher, D.R., Waggle, J., Leifeld, P.: Where does political polarization come from? Locating polarization within the us climate change debate. Am. Behav. Sci. 57(1), 70–92 (2013)
Garrett, R.K.: The “echo chamber" distraction: disinformation campaigns are the problem, not audience fragmentation. J. Appl. Res. Mem. Cogn. 6(4), 370–376 (2017). https://www.sciencedirect.com/science/article/pii/S2211368117301936
Golbeck, J., Hansen, D.: Computing political preference among twitter followers. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1105–1108 (2011)
González-Bailón, S., Wang, N.: Networked discontent: the anatomy of protest campaigns in social media. Soc.l Netw. 44, 95–104 (2016)
Gu, Y., Chen, T., Sun, Y., Wang, B.: Ideology Detection for twitter users via link analysis. In: Lee, D., Lin, Y.-R., Osgood, N., Thomson, R. (eds.) SBP-BRiMS 2017. LNCS, vol. 10354, pp. 262–268. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60240-0_32
Gurganus, J.: Russia: Playing a Geopolitical Game in Latin America. Carnegie Endownent for Peace (2018)
Hovy, D., Spruit, S.L.: The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 591–598 (2016)
KhudaBukhsh, A.R., Sarkar, R., Kamlet, M.S., Mitchell, T.M.: We don’t speak the same language: interpreting polarization through machine translation. In: AAAI 2021, pp. 14893–14901 (2021)
KhudaBukhsh, A.R., Sarkar, R., Kamlet, M.S., Mitchell, T.M.: Fringe news networks: dynamics of US news viewership following the 2020 presidential election. In: WebSci 2022: 14th ACM Web Science Conference 2022, pp. 269–278. ACM (2022)
Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP 2014, pp. 1746–1751, October 2014
Koutra, D., Bennett, P.N., Horvitz, E.: Events and controversies: Influences of a shocking news event on information seeking. CoRR abs/1405.1486 (2014). https://arxiv.org/abs/1405.1486
Ling, R.: Confirmation bias in the era of mobile news consumption: the social and psychological dimensions. Digit Journal. 8, 1–9 (2020)
McConnell, C., Margalit, Y., Malhotra, N., Levendusky, M.: Research: Political Polarization Is Changing How Americans Work and Shop. Harvard Business Review (2017)
Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: Semeval-2016 task 6: detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 31–41 (2016)
Olteanu, A., Castillo, C., Diaz, F., Kıcıman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2, 13 (2019)
Poole, K.T.: Howard: the polarization of American politics. J. Polit. 46(4), 1061–1079 (1984)
Poole, K.T., Rosenthal, H.: The polarization of American politics. J. Polit. 46(4), 1061–1079 (1984)
Prior, M.: Media and political polarization. Annu. Rev. Polit. Sci. 16, 101–127 (2013)
Rouvinski, V.: Understanding Russian priorities in Latin America. Kennan Cable 20 (2017)
Smith, S.L., Turban, D.H.P., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: 5th International Conference on Learning Representations, ICLR 2017 (2017)
Spohr, D.: Fake news and ideological polarization: filter bubbles and selective exposure on social media. Bus. Inf. Rev. 34(3), 150–160 (2017)
Swamy, S., Ritter, A., de Marneffe, M.C.: “i have a feeling trump will win..................": forecasting winners and losers from user predictions on twitter. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1583–1592 (2017)
Tsakalidis, A., Aletras, N., Cristea, A.I., Liakata, M.: Nowcasting the stance of social media users in a sudden vote: the case of the greek referendum. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 367–376 (2018)
Wong, F.M.F., Tan, C.W., Sen, S., Chiang, M.: Quantifying political leaning from tweets, retweets, and retweeters. IEEE Trans. Knowl. Data Eng. 28(8), 2158–2172 (2016)
Xiao, Z., Song, W., Xu, H., Ren, Z., Sun, Y.: TIMME: Twitter ideology-detection via multi-task multi-relational embedding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2258–2268 (2020)
Editors and Affiliations
A Data Collection
The dataset consists of 100 million tweets from 15+ million users collected using Twitter’s API v1 around the protests that transpired in countries studied. For each event, we built the queries by first identifying the most prominent hashtags/terms (using Twitter’s trending terms in the country). After some days of streaming, we determined the most frequent relevant hashtags not yet included and taking special effort to include hashtags used by different groups (for and against the each government). We included these to our query which were collected via weekly REST grabs (to ensure their collection from the start). By repeating this process each week, we built up the set of more than 500 hashtags. To improve the quality of the conversational structure present in the data, we also re-hydrated any missing targets or ancestors (up to 5 levels above in the conversation tree) of replies or quotes. Table 7 presents the relevant descriptive statistics for the collection. To better contextualize our work, we first present a brief overview of the main events that transpired in each of the countries.
1.1 A.1 Ethical Considerations
We make our data publicly available and, to adhere to Twitter’s terms and conditions for sharing data, we do not share the full JSON of the collected tweetsFootnote 5. Instead, we provide their respective tweet or user IDs, the type of tweet (Original, Reply or Quote), and in the case of weakly labeled users or tweets, their assigned label. Since the Tweets will have to be re-hydrated, if a user deletes a tweet (or their account), it will not be available for analysis ensuring that the user’s right to be forgotten is preserved. However, for the hand-labeled political figures (described later), given their public role during these events, we not only provide their user ID but also their user name and user type. We also release the full set of labeled stance tags.
B Weak-Labeled Dataset
We determine the user stance based on how prominently they tweet (or retweet) a hashtag from a given stance or retweet a labeled political figure. In this appendix we provide further details of the validation methodology used to prune the set of stance-tags and further details of the weak-labels obtained by each signal.
1.1 B.1 Stance-Tags Validation
Our weak labeling methodology relies on the hypothesis that users are more likely to tweet (or retweet) hashtags or political figures that are aligned with their stances during these events. Hence, weak-stance labels are assigned to a user if their percentage of tweets with a consistent stance-tag is above a given threshold. To test this hypothesis, we apply our methodology (just based on stance-tags) to predict the stance of the political figures labeled. We can also use this exercise to determine a suitable threshold for the stance assignment. We limit our analysis to the 88.1% of labeled users that tweeted (or retweeted) at least 5 tweets containing a stance stag. We also present results excluding the set of extra 229 stance-tags obtained using this set of users in order to have a better assessment of the performance in the wild. Figure 2 presents the accuracy of the methodology at different probability thresholds. As expected, higher thresholds are more conservative in the assignment of a label (the percentage of undetermined users increases) but also decrease the likelihood of missclassification. However, in the most aggressive classification threshold, only 2.6% of the users are missclassified, which supports our starting hypothesis.
For the construction of the dataset released in this work, we opt for a conservative 90% threshold, which results in 74.2% correctly classified users but only a 0.3% (2 users) classification error. The reason for this conservative approach is that our validation set is comprised of highly political users, which could result in a higher likelihood of missclassification among more casual users .
Nonetheless, we are able to considerably increase the performance of this methodology (with the 90% threshold) by first including the aforementioned 229 hashtags, used exclusively by user of each side, which improves the accuracy to 80.0%. Lastly, we prune our hashtag set by removing tags that were used too frequently by users of a different stance. This results in the removal of 46 hashtags and brings the final classification accuracy of our proposed weak-labeling methodology to 88.6%.
1.2 B.2 Assignment of User Stances
We assign the User stance based on how prominently they tweet (or retweet) a stance tag or retweet a labeled political figure. The threshold used to determine the stance was obtained during the hashtag validation procedure described above and set at 90%.
Hashtag Usage. Users were assigned a stance if they used stance tags either in their tweets (or retweets) or in their user description. In both cases, a stance was assigned to a tweet (or description) if it contains hashtags with the same stance, otherwise it was deemed inconsistent. As before, we only proceeded with users that had at least 5 tweets with a consistent stance or if at least one description was consistent. As less than 1% of labeled users were labeled based on their descriptions, we do not desegregate results based on the origin of the label. A stance was assigned to a user if at least 90% of their tweets had the same stance. The number of users classified and their distribution is presented in Table 8.
Endorsement of Political Figures. The procedure followed to assign a stance to a user based on their endorsement of political figures, follows the same logic as before. As such, users were assigned a stance if at least 90% of their retweets of labeled political figures are from users with the same stance. As before, we only proceed with users that had at least 5 retweets of these users. The number of users classified and their distribution is presented in Table 9.
C Filtering Irrelevant Media Tweets
We started with a dataset of news agencies and journalist for the countries explored (this was obtained from the NetMapper softwareFootnote 6). It had several limitations and was expanded by searching for the most important news agencies operating in each country, manually checking who they follow and adding agencies that were not included. This resulted in a list of 853 news agencies (or major reporters) detailing their Twitter handles and main URL (if available). Notably, the list included agencies from Venezuela and Russia that predominately operate in the region, this is important as we explore influence campaigns on the protests. We then proceeded to identify the agencies that were either directly retweeted by a user or that had a user tweet/retweet a URL corresponding to their domain. The number of news agencies from each country resulting from this process is shown in Table 10.
However, news articles identified in our data set cover topics ranging from the protests to sports. When studying the polarization of news consumption during the political event, it is important to first remove tweets which are irrelevant to the protests. It is not obvious if a tweet from a news agency is relevant or not, but many tweets in our data set contain the URL of an article that they reference. For this reason, we determined the relevance to the protest of a small set of the 900 most tweeted URLs in our dataset distruted among the different countries. We complemented this dataset with an additional set of URLs labeled by extracting subsection metadata from them. If the subsection referenced sports, culture or technology, the URL was labeled as irrelevant to the protests. Then, we assigned the URL label to any tweet that used it. The final sample distribution are presented in Table 11. We note that even though we are able to assign a label to more than 100k tweets, most of them contained duplicated text (as news media tend to tweet the same thing multiple times). The classification was done with the unduplicated dataset.
To classify the relevance of the tweets, we built a CNN text classifier  using 300 dimensional FastText embeddings trained on the combined datasets (both by stance and country) used to analyze the language polarization. We used 100 filters on 3 layers with filter sizes 3, 4 and 5 and a dropout rate of 50%. We achieved an accuracy and F1-score of 92% in a held-out test set. After predicting the labels of tweets (relevant or irrelavant to the protests), we obtain a dataset of 1,024,166 relevant and 675,496 irrelevant tweets. The distribution of the data set is shown in Table 11. The analysis of polarization in news consumption patterns presented in this works was done only on the tweets that are relevant to the protests.
Rights and permissions
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Villa-Cox, R., Zeng, H.S., KhudaBukhsh, A.R., Carley, K.M. (2022). Linguistic and News-Sharing Polarization During the 2019 South American Protests. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19096-4
Online ISBN: 978-3-031-19097-1
eBook Packages: Computer ScienceComputer Science (R0)