Skip to main content

Linguistic and News-Sharing Polarization During the 2019 South American Protests

  • 505 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13618)


In computational social science, two parallel research directions exploring – news consumption patterns and linguistic regularities – have made significant inroads into better understanding complex political polarization in the era of ubiquitous internet. However, little or no literature exists that presented a unified treatment combining both these research directions. When working on social events from countries that do not speak English as a first language, computational linguistic resource availability is often a barrier to sophisticated linguistic analyses. In this work, we analyze an important sociopolitical event, the 2019 South American protests, and demonstrate that (1) a combined treatment offers a more comprehensive understanding of the event; and (2) these cross-cutting methods can be applied in a synergistic way. The insights gained by the combination of these methods include that polarization in users’ news sharing patterns was consistent with their stances towards the government and that polarization in their language mainly manifested along ideological, political, or protest-related lines. In addition, we release a massive dataset of 15 million tweets relevant to this crisis.


  • Linguistic polarization
  • News-sharing polarization
  • South American protests

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-19097-1_5
  • Chapter length: 20 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-3-031-19097-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   89.99
Price excludes VAT (USA)
Fig. 1.


  1. 1.

  2. 2.

    Publicly available at:

  3. 3.

    This refers to the decree 883 which proposed austerity measures and started the protests in the country.

  4. 4.

    An outlet can be retweeted either directly or indirectly via a tweet originating from their account, or a third party tweet containing a url with their domain.

  5. 5.

    Note to reviewers: to uphold the anonymization policies for submission, we will make the link publicly available before publication.

  6. 6.


  1. Alhazmi, K., Alsumari, W., Seppo, I., Podkuiko, L., Simon, M.: Effects of annotation quality on model performance. In: 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 063–067 (2021)

    Google Scholar 

  2. Babcock, M., Cox, R.V.C., Kumar, S.: Diffusion of pro-and anti-false information tweets: the black panther movie case. Comput. Math. Organ. Theory 25(1), 72–84 (2019)

    CrossRef  Google Scholar 

  3. Babcock, M., Villa-Cox, R., Carley, K.M.: Pretending positive, pushing false: comparing captain marvel misinformation campaigns. In: Shu, K., Wang, S., Lee, D., Liu, H. (eds.) Disinformation, Misinformation, and Fake News in Social Media. LNSN, pp. 83–94. Springer, Cham (2020).

    CrossRef  Google Scholar 

  4. Baldwin, M., Lammers, J.: Past-focused environmental comparisons promote proenvironmental outcomes for conservatives. Proc. Natl. Acad. Sci. 113(52), 14953–14957 (2016)

    CrossRef  Google Scholar 

  5. Barberá, P., et al.: The critical periphery in the growth of social protests. PLoS ONE 10(11), e0143611 (2015)

    CrossRef  Google Scholar 

  6. Beguerisse-Díaz, M., Garduno-Hernández, G., Vangelov, B., Yaliraki, S.N., Barahona, M.: Interest communities and flow roles in directed networks: the twitter network of the UK riots. J. R. Soc. Interface 11(101), 20140940 (2014)

    CrossRef  Google Scholar 

  7. Darwish, K.: Quantifying polarization on twitter: the Kavanaugh nomination. arXiv abs/2001.02125 (2020)

    Google Scholar 

  8. Del, M., et al.: The spreading of misinformation online. Proc. Natl. Acad. Sci. 113(3), 554–559 (2016)

    CrossRef  Google Scholar 

  9. Demszky, D., et al.: Analyzing polarization in social media: method and application to tweets on 21 mass shootings. In: NAACL-HLT 2019, pp. 2970–3005. Association for Computational Linguistics (2019)

    Google Scholar 

  10. Evans, A.: Stance and identity in twitter hashtags. Lang. Internet 13(1) (2016)

    Google Scholar 

  11. Fisher, D.R., Waggle, J., Leifeld, P.: Where does political polarization come from? Locating polarization within the us climate change debate. Am. Behav. Sci. 57(1), 70–92 (2013)

    CrossRef  Google Scholar 

  12. Garrett, R.K.: The “echo chamber" distraction: disinformation campaigns are the problem, not audience fragmentation. J. Appl. Res. Mem. Cogn. 6(4), 370–376 (2017).

  13. Golbeck, J., Hansen, D.: Computing political preference among twitter followers. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1105–1108 (2011)

    Google Scholar 

  14. González-Bailón, S., Wang, N.: Networked discontent: the anatomy of protest campaigns in social media. Soc.l Netw. 44, 95–104 (2016)

    CrossRef  Google Scholar 

  15. Gu, Y., Chen, T., Sun, Y., Wang, B.: Ideology Detection for twitter users via link analysis. In: Lee, D., Lin, Y.-R., Osgood, N., Thomson, R. (eds.) SBP-BRiMS 2017. LNCS, vol. 10354, pp. 262–268. Springer, Cham (2017).

    CrossRef  Google Scholar 

  16. Gurganus, J.: Russia: Playing a Geopolitical Game in Latin America. Carnegie Endownent for Peace (2018)

    Google Scholar 

  17. Hovy, D., Spruit, S.L.: The social impact of natural language processing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 591–598 (2016)

    Google Scholar 

  18. KhudaBukhsh, A.R., Sarkar, R., Kamlet, M.S., Mitchell, T.M.: We don’t speak the same language: interpreting polarization through machine translation. In: AAAI 2021, pp. 14893–14901 (2021)

    Google Scholar 

  19. KhudaBukhsh, A.R., Sarkar, R., Kamlet, M.S., Mitchell, T.M.: Fringe news networks: dynamics of US news viewership following the 2020 presidential election. In: WebSci 2022: 14th ACM Web Science Conference 2022, pp. 269–278. ACM (2022)

    Google Scholar 

  20. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP 2014, pp. 1746–1751, October 2014

    Google Scholar 

  21. Koutra, D., Bennett, P.N., Horvitz, E.: Events and controversies: Influences of a shocking news event on information seeking. CoRR abs/1405.1486 (2014).

  22. Ling, R.: Confirmation bias in the era of mobile news consumption: the social and psychological dimensions. Digit Journal. 8, 1–9 (2020)

    Google Scholar 

  23. McConnell, C., Margalit, Y., Malhotra, N., Levendusky, M.: Research: Political Polarization Is Changing How Americans Work and Shop. Harvard Business Review (2017)

    Google Scholar 

  24. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: Semeval-2016 task 6: detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 31–41 (2016)

    Google Scholar 

  25. Olteanu, A., Castillo, C., Diaz, F., Kıcıman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2, 13 (2019)

    CrossRef  Google Scholar 

  26. Poole, K.T.: Howard: the polarization of American politics. J. Polit. 46(4), 1061–1079 (1984)

    CrossRef  Google Scholar 

  27. Poole, K.T., Rosenthal, H.: The polarization of American politics. J. Polit. 46(4), 1061–1079 (1984)

    CrossRef  Google Scholar 

  28. Prior, M.: Media and political polarization. Annu. Rev. Polit. Sci. 16, 101–127 (2013)

    CrossRef  Google Scholar 

  29. Rouvinski, V.: Understanding Russian priorities in Latin America. Kennan Cable 20 (2017)

    Google Scholar 

  30. Smith, S.L., Turban, D.H.P., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: 5th International Conference on Learning Representations, ICLR 2017 (2017)

    Google Scholar 

  31. Spohr, D.: Fake news and ideological polarization: filter bubbles and selective exposure on social media. Bus. Inf. Rev. 34(3), 150–160 (2017)

    Google Scholar 

  32. Swamy, S., Ritter, A., de Marneffe, M.C.: “i have a feeling trump will win..................": forecasting winners and losers from user predictions on twitter. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1583–1592 (2017)

    Google Scholar 

  33. Tsakalidis, A., Aletras, N., Cristea, A.I., Liakata, M.: Nowcasting the stance of social media users in a sudden vote: the case of the greek referendum. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 367–376 (2018)

    Google Scholar 

  34. Wong, F.M.F., Tan, C.W., Sen, S., Chiang, M.: Quantifying political leaning from tweets, retweets, and retweeters. IEEE Trans. Knowl. Data Eng. 28(8), 2158–2172 (2016)

    CrossRef  Google Scholar 

  35. Xiao, Z., Song, W., Xu, H., Ren, Z., Sun, Y.: TIMME: Twitter ideology-detection via multi-task multi-relational embedding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2258–2268 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ramon Villa-Cox .

Editor information

Editors and Affiliations


A Data Collection

Table 7. Collection period and number of tweets collected for each country.

The dataset consists of 100 million tweets from 15+ million users collected using Twitter’s API v1 around the protests that transpired in countries studied. For each event, we built the queries by first identifying the most prominent hashtags/terms (using Twitter’s trending terms in the country). After some days of streaming, we determined the most frequent relevant hashtags not yet included and taking special effort to include hashtags used by different groups (for and against the each government). We included these to our query which were collected via weekly REST grabs (to ensure their collection from the start). By repeating this process each week, we built up the set of more than 500 hashtags. To improve the quality of the conversational structure present in the data, we also re-hydrated any missing targets or ancestors (up to 5 levels above in the conversation tree) of replies or quotes. Table 7 presents the relevant descriptive statistics for the collection. To better contextualize our work, we first present a brief overview of the main events that transpired in each of the countries.

figure d

1.1 A.1 Ethical Considerations

We make our data publicly available and, to adhere to Twitter’s terms and conditions for sharing data, we do not share the full JSON of the collected tweetsFootnote 5. Instead, we provide their respective tweet or user IDs, the type of tweet (Original, Reply or Quote), and in the case of weakly labeled users or tweets, their assigned label. Since the Tweets will have to be re-hydrated, if a user deletes a tweet (or their account), it will not be available for analysis ensuring that the user’s right to be forgotten is preserved. However, for the hand-labeled political figures (described later), given their public role during these events, we not only provide their user ID but also their user name and user type. We also release the full set of labeled stance tags.

B Weak-Labeled Dataset

We determine the user stance based on how prominently they tweet (or retweet) a hashtag from a given stance or retweet a labeled political figure. In this appendix we provide further details of the validation methodology used to prune the set of stance-tags and further details of the weak-labels obtained by each signal.

Fig. 2.
figure 2

Performance of the weak labeling methodology on the labeled political figures at different probability thresholds. The chosen threshold for the construction of the dataset is indicated by the dashed vertical line.

1.1 B.1 Stance-Tags Validation

Our weak labeling methodology relies on the hypothesis that users are more likely to tweet (or retweet) hashtags or political figures that are aligned with their stances during these events. Hence, weak-stance labels are assigned to a user if their percentage of tweets with a consistent stance-tag is above a given threshold. To test this hypothesis, we apply our methodology (just based on stance-tags) to predict the stance of the political figures labeled. We can also use this exercise to determine a suitable threshold for the stance assignment. We limit our analysis to the 88.1% of labeled users that tweeted (or retweeted) at least 5 tweets containing a stance stag. We also present results excluding the set of extra 229 stance-tags obtained using this set of users in order to have a better assessment of the performance in the wild. Figure 2 presents the accuracy of the methodology at different probability thresholds. As expected, higher thresholds are more conservative in the assignment of a label (the percentage of undetermined users increases) but also decrease the likelihood of missclassification. However, in the most aggressive classification threshold, only 2.6% of the users are missclassified, which supports our starting hypothesis.

For the construction of the dataset released in this work, we opt for a conservative 90% threshold, which results in 74.2% correctly classified users but only a 0.3% (2 users) classification error. The reason for this conservative approach is that our validation set is comprised of highly political users, which could result in a higher likelihood of missclassification among more casual users [10].

Nonetheless, we are able to considerably increase the performance of this methodology (with the 90% threshold) by first including the aforementioned 229 hashtags, used exclusively by user of each side, which improves the accuracy to 80.0%. Lastly, we prune our hashtag set by removing tags that were used too frequently by users of a different stance. This results in the removal of 46 hashtags and brings the final classification accuracy of our proposed weak-labeling methodology to 88.6%.

1.2 B.2 Assignment of User Stances

We assign the User stance based on how prominently they tweet (or retweet) a stance tag or retweet a labeled political figure. The threshold used to determine the stance was obtained during the hashtag validation procedure described above and set at 90%.

Table 8. Government Stance of users based on hashtag usage.

Hashtag Usage. Users were assigned a stance if they used stance tags either in their tweets (or retweets) or in their user description. In both cases, a stance was assigned to a tweet (or description) if it contains hashtags with the same stance, otherwise it was deemed inconsistent. As before, we only proceeded with users that had at least 5 tweets with a consistent stance or if at least one description was consistent. As less than 1% of labeled users were labeled based on their descriptions, we do not desegregate results based on the origin of the label. A stance was assigned to a user if at least 90% of their tweets had the same stance. The number of users classified and their distribution is presented in Table 8.

Endorsement of Political Figures. The procedure followed to assign a stance to a user based on their endorsement of political figures, follows the same logic as before. As such, users were assigned a stance if at least 90% of their retweets of labeled political figures are from users with the same stance. As before, we only proceed with users that had at least 5 retweets of these users. The number of users classified and their distribution is presented in Table 9.

Table 9. Government stance of users based on endorsement of political figures.
Table 10. Number of news agencies in each country. *The regional category includes regional Venezuelan and Russian media among others.
Table 11. Distribution of the labeled tweets and resulting predictions after classification.

C Filtering Irrelevant Media Tweets

We started with a dataset of news agencies and journalist for the countries explored (this was obtained from the NetMapper softwareFootnote 6). It had several limitations and was expanded by searching for the most important news agencies operating in each country, manually checking who they follow and adding agencies that were not included. This resulted in a list of 853 news agencies (or major reporters) detailing their Twitter handles and main URL (if available). Notably, the list included agencies from Venezuela and Russia that predominately operate in the region, this is important as we explore influence campaigns on the protests. We then proceeded to identify the agencies that were either directly retweeted by a user or that had a user tweet/retweet a URL corresponding to their domain. The number of news agencies from each country resulting from this process is shown in Table 10.

However, news articles identified in our data set cover topics ranging from the protests to sports. When studying the polarization of news consumption during the political event, it is important to first remove tweets which are irrelevant to the protests. It is not obvious if a tweet from a news agency is relevant or not, but many tweets in our data set contain the URL of an article that they reference. For this reason, we determined the relevance to the protest of a small set of the 900 most tweeted URLs in our dataset distruted among the different countries. We complemented this dataset with an additional set of URLs labeled by extracting subsection metadata from them. If the subsection referenced sports, culture or technology, the URL was labeled as irrelevant to the protests. Then, we assigned the URL label to any tweet that used it. The final sample distribution are presented in Table 11. We note that even though we are able to assign a label to more than 100k tweets, most of them contained duplicated text (as news media tend to tweet the same thing multiple times). The classification was done with the unduplicated dataset.

To classify the relevance of the tweets, we built a CNN text classifier [20] using 300 dimensional FastText embeddings trained on the combined datasets (both by stance and country) used to analyze the language polarization. We used 100 filters on 3 layers with filter sizes 3, 4 and 5 and a dropout rate of 50%. We achieved an accuracy and F1-score of 92% in a held-out test set. After predicting the labels of tweets (relevant or irrelavant to the protests), we obtain a dataset of 1,024,166 relevant and 675,496 irrelevant tweets. The distribution of the data set is shown in Table 11. The analysis of polarization in news consumption patterns presented in this works was done only on the tweets that are relevant to the protests.

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Villa-Cox, R., Zeng, H.S., KhudaBukhsh, A.R., Carley, K.M. (2022). Linguistic and News-Sharing Polarization During the 2019 South American Protests. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19096-4

  • Online ISBN: 978-3-031-19097-1

  • eBook Packages: Computer ScienceComputer Science (R0)