Abstract
Emerged in Wuhan city of China in December 2019, COVID-19 continues to spread rapidly across the world despite authorities having made available a number of vaccines. While the coronavirus has been around for a significant period of time, people and authorities still feel the need for awareness due to the mutating nature of the virus and therefore varying symptoms and prevention strategies. People and authorities resort to social media platforms the most to share awareness information and voice out their opinions due to their massive outreach in spreading the word in practically no time. People use a number of languages to communicate over social media platforms based on their familiarity, language outreach, and availability on social media platforms. The entire world has been hit by the coronavirus and India is the second worst-hit country in terms of the number of active coronavirus cases. India, being a multilingual country, offers a great opportunity to study the outreach of various languages that have been actively used across social media platforms. In this study, we aim to study the dataset related to COVID-19 collected in the period between February 2020 to July 2020 specifically for regional languages in India. This could be helpful for the Government of India, various state governments, NGOs, researchers, and policymakers in studying different issues related to the pandemic. We found that English has been the mode of communication in over 64% of tweets while as many as twelve regional languages in India account for approximately 4.77% of tweets .
D. Uniyal and A. Agarwal—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Broniatowski, D.A., Paul, M.J., Dredze, M.: National and local influenza surveillance through Twitter: an analysis of the 2012–2013 influenza epidemic. PLoS ONE 8(12), e83672 (2013)
Vieweg, S., Hughes, A.L., Starbird, K., Palen, L.: Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1079–1088 (2010)
Güner, H.R., Hasanoğlu, I., Aktaş, F.: COVID-19: prevention and control measures in community. Turk. J. Med. Sci. 50(SI–1), 571–577 (2020)
Alqurashi, S., Alhindi, A., Alanazi, E.: Large Arabic Twitter dataset on COVID-19. arXiv preprint arXiv:2004.04315 (2020)
Primack, B.A., et al.: Social media use and perceived social isolation among young adults in the us. Am. J. Prev. Med. 53(1), 1–8 (2017)
González-Padilla, D.A., Tortolero-Blanco, L.: Social media influence in the COVID-19 pandemic. Int. braz j urol 46, 120–124 (2020)
Census (2021). https://en.wikipedia.org/wiki/2001_Census_of_India. Accessed 1 Apr 2021
Github (2021). https://github.com/deepakuniyaliit/Covid19IRLTDataset. Accessed 1 Apr 2021
Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K.: Measuring user influence in Twitter: the million follower fallacy. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 4 (2010)
Li, L., et al.: Characterizing the propagation of situational information in social media during COVID-19 epidemic: a case study on Weibo. IEEE Trans. Comput. Soc. Syst. 7(2), 556–562 (2020)
Agarwal, A., Uniyal, D., Toshniwal, D., Deb, D.: Dense vector embedding based approach to identify prominent disseminators from Twitter data amid COVID-19 outbreak. IEEE Trans. Emerg. Top. Comput. Intell. 5(3), 308–320 (2021)
Kouzy, R., et al.: Coronavirus goes viral: quantifying the COVID-19 misinformation epidemic on Twitter. Cureus 12(3), e7255 (2020)
Choi, D., Chun, S., Hyunchul, O., Han, J., et al.: Rumor propagation is amplified by echo chambers in social media. Sci. Rep. 10(1), 1–10 (2020)
Alharbi, A., Lee, M.: Kawarith: an Arabic Twitter corpus for crisis events. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 42–52 (2021)
Agarwal, A., Toshniwal, D.: Identifying leadership characteristics from social media data during natural hazards using personality traits. Sci. Rep. 10(1), 1–15 (2020)
Barkur, G., Vibha, G.B.K.: Sentiment analysis of nationwide lockdown due to COVID 19 outbreak: evidence from India. Asian J. Psychiatr. 51, 102089 (2020)
Han, X., Wang, J., Zhang, M., Wang, X.: Using social media to mine and analyze public opinion related to COVID-19 in China. Int. J. Environ. Res. Public Health 17(8), 2788 (2020)
Ferrara, E.: What types of COVID-19 conspiracies are populated by Twitter bots? First Monday (2020)
Sharma, K., Seo, S., Meng, C., Rambhatla, S., Liu, Y.: COVID-19 on social media: analyzing misinformation in Twitter conversations. arXiv:2003.12309 (2020)
Brennen, J.S., Simon, F., Howard, P.N., Nielsen, R.K.: Types, sources, and claims of COVID-19 misinformation. Reuters Inst. 7(3), 1 (2020)
Gupta, L., Gasparyan, A.Y., Misra, D.P., Agarwal, V., Zimba, O., Yessirkepov, M.: Information and misinformation on COVID-19: a cross-sectional survey study. J. Korean Med. Sci. 35(27), e256 (2020)
Banda, J.M., et al.: A large-scale COVID-19 Twitter chatter dataset for open scientific research-an international collaboration. arXiv preprint arXiv:2004.03688 (2020)
Zarei, K., Farahbakhsh, R., Crespi, N., Tyson, G.: A first Instagram dataset on COVID-19. arXiv preprint arXiv:2004.12226 (2020)
Hu, Y., Huang, H., Chen, A., Mao, X.L.: Weibo-COV: a large-scale COVID-19 social media dataset from Weibo (2020)
Haouari, F., Hasanain, M., Suwaileh, R., Elsayed, T.: ArCOV-19: the first Arabic COVID-19 twitter dataset with propagation networks. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, pp. 82–91 (2021)
Qazi, U., Imran, M., Ofli, F.: GeoCoV19: a dataset of hundreds of millions of multilingual COVID-19 tweets with location information. SIGSPATIAL Spec. 12(1), 6–15 (2020)
Gao, Z., Yada, S., Wakamiya, S., Aramaki, E.: NAIST COVID: multilingual COVID-19 Twitter and Weibo dataset. arXiv preprint arXiv:2004.08145 (2020)
Aguilar-Gallegos, N., Romero-García, L.E., Martínez-González, E.G., Iván García-Sánchez, E., Aguilar-Ávila, J.: Dataset on dynamics of coronavirus on Twitter. Data Brief 30, 105684 (2020)
Shahi, G.K., Nandini, D.: FakeCovid-a multilingual cross-domain fact check news dataset for COVID-19. arXiv preprint arXiv:2006.11343 (2020)
Chen, E., Lerman, K., Ferrara, E.: Tracking social media discourse about the COVID-19 pandemic: development of a public coronavirus Twitter data set. JMIR Public Health Surveill. 6(2), e19273 (2020)
Uniyal, D., Rai, A.: Citizens’ emotion on GST: a spatio-temporal analysis over Twitter data. arXiv preprint arXiv:1906.08693 (2019)
Uniyal, D., Uniyal, S.: Social media emerging as a third eye!! Decoding users’ sentiment on government policy: a case study of GST. In: 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pp. 116–122. IEEE (2020)
Agarwal, A., Singh, R., Toshniwal, D.: Geospatial sentiment analysis using twitter data for UK-EU referendum. J. Inf. Optim. Sci. 39(1), 303–317 (2018)
Agarwal, A., Toshniwal, D.: Face off: travel habits, road conditions and traffic city characteristics bared using Twitter. IEEE Access 7, 66536–66552 (2019)
Geopy (2021). https://geopy.readthedocs.io/en/stable/. Accessed 8 Apr 2021
Cataldi, M., Aufaure, M.-A.: The 10 million follower fallacy: audience size does not prove domain-influence on Twitter. Knowl. Inf. Syst. 44(3), 559–580 (2014). https://doi.org/10.1007/s10115-014-0773-8
Twitter Developer Policy (2021). https://developer.twitter.com/en/developer-terms/agreement-and-policy. Accessed 1 Apr 2021
Hydrator (2021). https://github.com/DocNow/hydrator. Accessed 1 Apr 2021
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Uniyal, D., Agarwal, A. (2021). IRLCov19: A Large COVID-19 Multilingual Twitter Dataset of Indian Regional Languages. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1525. Springer, Cham. https://doi.org/10.1007/978-3-030-93733-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-93733-1_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93732-4
Online ISBN: 978-3-030-93733-1
eBook Packages: Computer ScienceComputer Science (R0)