Visualizing Regional Language Variation Across Europe on Twitter

Hovy, Dirk; Rahimi, Afshin; Baldwin, Timothy; Brooke, Julian

doi:10.1007/978-3-319-73400-2_175-1

Dirk Hovy³,
Afshin Rahimi⁴,
Timothy Baldwin⁵ &
…
Julian Brooke⁶

237 Accesses
4 Altmetric

Abstract

Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual level. Based on data-driven studies of such relationships, this paper investigates regional variation of language usage on Twitter across Europe and compares it to traditional research of regional variation. This paper presents a novel method to process large amounts of data and to capture gradual differences in language variation. Visualizing the results by deterministically translating linguistic features into color hues presents a novel view of language variation across Europe, as it is reflected on Twitter. The technique is easy to apply to large amounts of data and provides a fast visual reference that can serve as input for further qualitative studies. The general applicability is demonstrated on a number of studies both across and within national languages. This paper also discusses the unique challenges of large-scale analysis and visualization, and the complementary nature of traditional qualitative and data-driven quantitative methods, and argues for their possible synthesis.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Alowibdi, J. S., Buy, U. A., & Yu, P. (2013). Empirical evaluation of profile characteristics for gender classification on Twitter. In 2013 12th International Conference on Machine Learning and Applications (ICMLA) (Vol. 1, pp. 365–369).
Chapter Google Scholar
Auer, P., & Schmidt, J. E. (2010). Language and space: An international handbook of linguistic variation (Volume 1: Theories and methods). Berlin: DeGruyter.
Google Scholar
Bamman, D., Dyer, C., & Smith, N. A. (2014). Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 828–834).
Google Scholar
Benton, A., Mitchell, M., & Hovy, D. (2017). Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 152–162).
Google Scholar
Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., & Yarowsky, D. (2013). Broadly improving user classification via communication-based name and location clustering on twitter. In Proceedings of North American Chapter of the Association for Computational Linguistics (pp. 1010–1019).
Google Scholar
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol, CA: O’Reilly Media, Inc.
Google Scholar
Chambers, J., & Trudgill, P. (1998). Dialectology. Cambridge, UK: Cambridge University Press.
Google Scholar
Cheshire, J. (2005). Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers1. Journal of SocioLinguistics, 9(4), 479–508.
Article Google Scholar
Ciot, M., Sonderegger, M., & Ruths, D. (2013). Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle (pp. 18–21).
Google Scholar
DOMO. (2017). Data never sleeps 5.0: How much data is generated every minute? Available at https://web-assets.domo.com/blog/wp-content/uploads/2017/07/17_domo_data-never-sleeps-5-01.png. Last accessed August 13, 2018.
Doyle, G. (2014). Mapping dialectal variation by querying social media. In Proceedings of European Chapter of the Association for Computational Linguistics (pp. 98–106).
Google Scholar
Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 359–369).
Google Scholar
Eisenstein, J. (2015). Systematic patterning in phonologically-motivated orthographic variation. Journal of SocioLinguistics, 19(2), 161–188.
Article Google Scholar
Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 1365–1374).
Google Scholar
Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. P. (2014). Diffusion of lexical change in social media. PLoS One, 9.
Google Scholar
Gilliéron, J., & Edmont, E. (1902–1910). Atlas linguistique de la France. Paris: Champion.
Google Scholar
Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Vienna: Österreichische Akademie der Wissenschaften.
Google Scholar
Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.
Article Google Scholar
Grieve, J., Speelman, D., & Geeraerts, D. (2011). A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, 23, 193–221.
Article Google Scholar
Heeringa, W. (2004). Measuring dialect pronunciation differences using Levenshtein distance (Unpublished doctoral dissertation). University of Groningen.
Google Scholar
Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 752–762).
Google Scholar
Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Google Scholar
Hovy, D., Johannsen, A., & Søgaard, A. (2015). User review sites as a resource for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (pp. 452–461). International World Wide Web Conferences Steering Committee.
Google Scholar
Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 59, 244–255.
Article Google Scholar
Jaberg, K., Jud, J., & Scheuermeier, P. (1928–1940). Sprach- und Sachatlas Italiens und der Südschweiz. Zofingen: Ringier.
Google Scholar
Johannsen, A., Hovy, D., & Søgaard, A. (2015). Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (pp. 103–112).
Google Scholar
Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.
Google Scholar
Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data and Society, 1(1), 2053951714528481.
Article Google Scholar
Kehrein, R. (2017). Languages. In Brunn, Stanley D./Dodge, Martin (eds.): Mapping Across Academia (pp. 183-208). Springer, Dordrecht.
Chapter Google Scholar
Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Berlin: Walter de Gruyter.
Google Scholar
Lameli, A. (2013). Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. In Linguistik – Impulse und Tendenzen 54. Berlin/Boston: Walter de Gruyter.
Google Scholar
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188–1196).
Google Scholar
Liu, W., & Ruths, D. (2013). What’s in a name? Using first names as features for gender inference in Twitter. In AAAI Spring Symposium: Analyzing Microtext (Vol. 13, no. 1, pp. 10–16).
Google Scholar
Longley, P. A., Adnan, M., & Lansley, G. (2015). The geotemporal demographics of twitter usage. Environment and Planning A, 47(2), 465–484.
Article Google Scholar
Lui, M., & Baldwin, T. (2012). Langid. Py: An off-the-shelf language identification tool. In Proceedings of the Conference of the Association for Computational Linguistics, System Demonstrations (pp. 25–30). Association for Computational Linguistics.
Google Scholar
Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).
Google Scholar
Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity. In D. Sankoff & J. Kruska (Eds.), Time warps, string edits and macromolecules: The theory and practice of sequence comparison (pp. v–xv). Stanford, CA: CSLI.
Google Scholar
Nguyen, D., Smith, N. A., & Rosé, C. P. (2011). Author age prediction from text using linear regression. In Proceedings of the 5th Association for Computational Linguistics: Human Language Technologies, Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 115–123).
Google Scholar
Nguyen, D., Doğruöz, A. S., Rosé, C. P., & de Jong, F. (2016). Computational sociolinguistics: A survey. Computational Linguistics, 42, 537–593.
Article Google Scholar
Orton, H., & Dieth, E. (1962–1971). Survey of English dialects: Basic materials. Leeds: E.J. Arnold and Son.
Google Scholar
Pröll, S. (2013). Detecting structures in linguistic maps – Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing, 28, 108–118.
Article Google Scholar
Purschke, C., & Hovy, D. (2019). Lörres, Möppes, and the Swiss. (Re)discovering regional patterns in anonymous social media data. Journal of Linguistic Geography, 1.
Google Scholar
Rabanus, S. (2019). Language mapping worldwide: Methods and traditions. In: Stanley D. Brunn and Roland Kehrein (Eds.) Handbook of the Changing World Language Map. Dordrecht: Springer.
Google Scholar
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta: ELRA. http://is.muni.cz/publication/884893/en.
Google Scholar
Rosenthal, S., & McKeown, K. (2011). Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 763–772).
Google Scholar
Schmidt, J. E., & Herrgen, J. (2001). Digitaler Wenker-Atlas (DiWA). Bearbeitet von Alfred Lameli, Tanja Giessler, Roland Kehrein, Alexandra Lenz, Karl-Heinz Müller, Jost Nickel, Christoph Purschke und Stefan Rabanus. Erste vollständige Ausgabe von Georg Wenkers “Sprachatlas des Deutschen Reichs”. Retrieved June 2, 2014, from http://diwa.info/titel.aspx
Shackleton, R. G., Jr. (2005). English-American speech relationships: A quantitative approach. Journal of English Linguistics, 33(2), 99–160.
Article Google Scholar
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Article Google Scholar
Volkova, S.,Wilson, T., & Yarowsky, D. (2013). Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1815–1827). Association for Computational Linguistics.
Google Scholar
Volkova, S., Bachrach, Y., Armstrong, M., & Sharma, V. (2015). Inferring latent user properties from texts published in social media (demo). In Proceedings of the 29th Conference on Artificial Intelligence (AAAI) (pp. 4296–4297).
Google Scholar
Wenker, G., & Wrede, F. (1881). Der Sprachatlas des deutschen Reichs. Elwert.
Google Scholar
Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. The Annual Review of Linguistics, 1, 243–264.
Article Google Scholar
Wieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS One, 6(9), e23613.
Article Google Scholar
Wiesinger, P. (1983). Die Einteilung der deutschen Dialekte. In Werner Besch, Ulrich Knoop, Wolfgang Putschke & Herbert Ernst Wiegand (eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807–900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin: de Gruyter.
Google Scholar

Download references

Publisher’s note:

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Department of Marketing, Bocconi University, Milan, Italy
Dirk Hovy
Computer Science Department, The University of Melbourne, Melbourne, VIC, Australia
Afshin Rahimi
School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
Timothy Baldwin
Department of Linguistics, University of British Columbia, Kelowna, BC, Canada
Julian Brooke

Authors

Dirk Hovy
View author publications
You can also search for this author in PubMed Google Scholar
Afshin Rahimi
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Baldwin
View author publications
You can also search for this author in PubMed Google Scholar
Julian Brooke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dirk Hovy .

Editor information

Editors and Affiliations

Department of Geography, University of Kentucky Department of Geography, Lexington, KY, USA
Stanley D Brunn
Deutscher Sprachatlas, Marburg University, Marburg, Hessen, Germany
Roland Kehrein

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Hovy, D., Rahimi, A., Baldwin, T., Brooke, J. (2019). Visualizing Regional Language Variation Across Europe on Twitter. In: Brunn, S., Kehrein, R. (eds) Handbook of the Changing World Language Map. Springer, Cham. https://doi.org/10.1007/978-3-319-73400-2_175-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-73400-2_175-1
Received: 30 August 2018
Accepted: 30 August 2018
Published: 22 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73400-2
Online ISBN: 978-3-319-73400-2
eBook Packages: Springer Reference Earth and Environm. ScienceReference Module Physical and Materials ScienceReference Module Earth and Environmental Sciences

Publish with us

Policies and ethics