Skip to main content

Visualizing Regional Language Variation Across Europe on Twitter

  • Living reference work entry
  • First Online:
Handbook of the Changing World Language Map

Abstract

Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual level. Based on data-driven studies of such relationships, this paper investigates regional variation of language usage on Twitter across Europe and compares it to traditional research of regional variation. This paper presents a novel method to process large amounts of data and to capture gradual differences in language variation. Visualizing the results by deterministically translating linguistic features into color hues presents a novel view of language variation across Europe, as it is reflected on Twitter. The technique is easy to apply to large amounts of data and provides a fast visual reference that can serve as input for further qualitative studies. The general applicability is demonstrated on a number of studies both across and within national languages. This paper also discusses the unique challenges of large-scale analysis and visualization, and the complementary nature of traditional qualitative and data-driven quantitative methods, and argues for their possible synthesis.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Alowibdi, J. S., Buy, U. A., & Yu, P. (2013). Empirical evaluation of profile characteristics for gender classification on Twitter. In 2013 12th International Conference on Machine Learning and Applications (ICMLA) (Vol. 1, pp. 365–369).

    Chapter  Google Scholar 

  • Auer, P., & Schmidt, J. E. (2010). Language and space: An international handbook of linguistic variation (Volume 1: Theories and methods). Berlin: DeGruyter.

    Google Scholar 

  • Bamman, D., Dyer, C., & Smith, N. A. (2014). Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 828–834).

    Google Scholar 

  • Benton, A., Mitchell, M., & Hovy, D. (2017). Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 152–162).

    Google Scholar 

  • Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., & Yarowsky, D. (2013). Broadly improving user classification via communication-based name and location clustering on twitter. In Proceedings of North American Chapter of the Association for Computational Linguistics (pp. 1010–1019).

    Google Scholar 

  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol, CA: O’Reilly Media, Inc.

    Google Scholar 

  • Chambers, J., & Trudgill, P. (1998). Dialectology. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Cheshire, J. (2005). Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers1. Journal of SocioLinguistics, 9(4), 479–508.

    Article  Google Scholar 

  • Ciot, M., Sonderegger, M., & Ruths, D. (2013). Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle (pp. 18–21).

    Google Scholar 

  • DOMO. (2017). Data never sleeps 5.0: How much data is generated every minute? Available at https://web-assets.domo.com/blog/wp-content/uploads/2017/07/17_domo_data-never-sleeps-5-01.png. Last accessed August 13, 2018.

  • Doyle, G. (2014). Mapping dialectal variation by querying social media. In Proceedings of European Chapter of the Association for Computational Linguistics (pp. 98–106).

    Google Scholar 

  • Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 359–369).

    Google Scholar 

  • Eisenstein, J. (2015). Systematic patterning in phonologically-motivated orthographic variation. Journal of SocioLinguistics, 19(2), 161–188.

    Article  Google Scholar 

  • Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 1365–1374).

    Google Scholar 

  • Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. P. (2014). Diffusion of lexical change in social media. PLoS One, 9.

    Google Scholar 

  • Gilliéron, J., & Edmont, E. (1902–1910). Atlas linguistique de la France. Paris: Champion.

    Google Scholar 

  • Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Vienna: Österreichische Akademie der Wissenschaften.

    Google Scholar 

  • Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.

    Article  Google Scholar 

  • Grieve, J., Speelman, D., & Geeraerts, D. (2011). A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, 23, 193–221.

    Article  Google Scholar 

  • Heeringa, W. (2004). Measuring dialect pronunciation differences using Levenshtein distance (Unpublished doctoral dissertation). University of Groningen.

    Google Scholar 

  • Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 752–762).

    Google Scholar 

  • Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

    Google Scholar 

  • Hovy, D., Johannsen, A., & Søgaard, A. (2015). User review sites as a resource for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (pp. 452–461). International World Wide Web Conferences Steering Committee.

    Google Scholar 

  • Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 59, 244–255.

    Article  Google Scholar 

  • Jaberg, K., Jud, J., & Scheuermeier, P. (1928–1940). Sprach- und Sachatlas Italiens und der Südschweiz. Zofingen: Ringier.

    Google Scholar 

  • Johannsen, A., Hovy, D., & Søgaard, A. (2015). Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (pp. 103–112).

    Google Scholar 

  • Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.

    Google Scholar 

  • Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data and Society, 1(1), 2053951714528481.

    Article  Google Scholar 

  • Kehrein, R. (2017). Languages. In Brunn, Stanley D./Dodge, Martin (eds.): Mapping Across Academia (pp. 183-208). Springer, Dordrecht.

    Chapter  Google Scholar 

  • Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Berlin: Walter de Gruyter.

    Google Scholar 

  • Lameli, A. (2013). Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. In Linguistik – Impulse und Tendenzen 54. Berlin/Boston: Walter de Gruyter.

    Google Scholar 

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188–1196).

    Google Scholar 

  • Liu, W., & Ruths, D. (2013). What’s in a name? Using first names as features for gender inference in Twitter. In AAAI Spring Symposium: Analyzing Microtext (Vol. 13, no. 1, pp. 10–16).

    Google Scholar 

  • Longley, P. A., Adnan, M., & Lansley, G. (2015). The geotemporal demographics of twitter usage. Environment and Planning A, 47(2), 465–484.

    Article  Google Scholar 

  • Lui, M., & Baldwin, T. (2012). Langid. Py: An off-the-shelf language identification tool. In Proceedings of the Conference of the Association for Computational Linguistics, System Demonstrations (pp. 25–30). Association for Computational Linguistics.

    Google Scholar 

  • Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).

    Google Scholar 

  • Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity. In D. Sankoff & J. Kruska (Eds.), Time warps, string edits and macromolecules: The theory and practice of sequence comparison (pp. v–xv). Stanford, CA: CSLI.

    Google Scholar 

  • Nguyen, D., Smith, N. A., & Rosé, C. P. (2011). Author age prediction from text using linear regression. In Proceedings of the 5th Association for Computational Linguistics: Human Language Technologies, Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 115–123).

    Google Scholar 

  • Nguyen, D., Doğruöz, A. S., Rosé, C. P., & de Jong, F. (2016). Computational sociolinguistics: A survey. Computational Linguistics, 42, 537–593.

    Article  Google Scholar 

  • Orton, H., & Dieth, E. (1962–1971). Survey of English dialects: Basic materials. Leeds: E.J. Arnold and Son.

    Google Scholar 

  • Pröll, S. (2013). Detecting structures in linguistic maps – Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing, 28, 108–118.

    Article  Google Scholar 

  • Purschke, C., & Hovy, D. (2019). Lörres, Möppes, and the Swiss. (Re)discovering regional patterns in anonymous social media data. Journal of Linguistic Geography, 1.

    Google Scholar 

  • Rabanus, S. (2019). Language mapping worldwide: Methods and traditions. In: Stanley D. Brunn and Roland Kehrein (Eds.) Handbook of the Changing World Language Map. Dordrecht: Springer.

    Google Scholar 

  • Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta: ELRA. http://is.muni.cz/publication/884893/en.

    Google Scholar 

  • Rosenthal, S., & McKeown, K. (2011). Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 763–772).

    Google Scholar 

  • Schmidt, J. E., & Herrgen, J. (2001). Digitaler Wenker-Atlas (DiWA). Bearbeitet von Alfred Lameli, Tanja Giessler, Roland Kehrein, Alexandra Lenz, Karl-Heinz Müller, Jost Nickel, Christoph Purschke und Stefan Rabanus. Erste vollständige Ausgabe von Georg Wenkers “Sprachatlas des Deutschen Reichs”. Retrieved June 2, 2014, from http://diwa.info/titel.aspx

  • Shackleton, R. G., Jr. (2005). English-American speech relationships: A quantitative approach. Journal of English Linguistics, 33(2), 99–160.

    Article  Google Scholar 

  • Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

    Article  Google Scholar 

  • Volkova, S.,Wilson, T., & Yarowsky, D. (2013). Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1815–1827). Association for Computational Linguistics.

    Google Scholar 

  • Volkova, S., Bachrach, Y., Armstrong, M., & Sharma, V. (2015). Inferring latent user properties from texts published in social media (demo). In Proceedings of the 29th Conference on Artificial Intelligence (AAAI) (pp. 4296–4297).

    Google Scholar 

  • Wenker, G., & Wrede, F. (1881). Der Sprachatlas des deutschen Reichs. Elwert.

    Google Scholar 

  • Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. The Annual Review of Linguistics, 1, 243–264.

    Article  Google Scholar 

  • Wieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS One, 6(9), e23613.

    Article  Google Scholar 

  • Wiesinger, P. (1983). Die Einteilung der deutschen Dialekte. In Werner Besch, Ulrich Knoop, Wolfgang Putschke & Herbert Ernst Wiegand (eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807–900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin: de Gruyter.

    Google Scholar 

Download references

Publisher’s note:

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dirk Hovy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Hovy, D., Rahimi, A., Baldwin, T., Brooke, J. (2019). Visualizing Regional Language Variation Across Europe on Twitter. In: Brunn, S., Kehrein, R. (eds) Handbook of the Changing World Language Map. Springer, Cham. https://doi.org/10.1007/978-3-319-73400-2_175-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73400-2_175-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73400-2

  • Online ISBN: 978-3-319-73400-2

  • eBook Packages: Springer Reference Earth and Environm. ScienceReference Module Physical and Materials ScienceReference Module Earth and Environmental Sciences

Publish with us

Policies and ethics