Advertisement

Visualizing Regional Language Variation Across Europe on Twitter

  • Dirk HovyEmail author
  • Afshin Rahimi
  • Timothy Baldwin
  • Julian Brooke
Reference work entry

Abstract

Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual level. Based on data-driven studies of such relationships, this paper investigates regional variation of language usage on Twitter across Europe and compares it to traditional research of regional variation. This paper presents a novel method to process large amounts of data and to capture gradual differences in language variation. Visualizing the results by deterministically translating linguistic features into color hues presents a novel view of language variation across Europe, as it is reflected on Twitter. The technique is easy to apply to large amounts of data and provides a fast visual reference that can serve as input for further qualitative studies. The general applicability is demonstrated on a number of studies both across and within national languages. This paper also discusses the unique challenges of large-scale analysis and visualization, and the complementary nature of traditional qualitative and data-driven quantitative methods, and argues for their possible synthesis.

Keywords

Twitter Language variation Computational methods Computational sociolinguistics Representation learning 

Notes

Publisher’s note:

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Alowibdi, J. S., Buy, U. A., & Yu, P. (2013). Empirical evaluation of profile characteristics for gender classification on Twitter. In 2013 12th International Conference on Machine Learning and Applications (ICMLA) (Vol. 1, pp. 365–369).CrossRefGoogle Scholar
  2. Auer, P., & Schmidt, J.E. (Eds.) (2010). Language and space: An international handbook of linguistic variation (Volume 1: Theories and methods). Berlin: DeGruyter.Google Scholar
  3. Bamman, D., Dyer, C., & Smith, N. A. (2014). Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 828–834).Google Scholar
  4. Benton, A., Mitchell, M., & Hovy, D. (2017). Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 152–162).Google Scholar
  5. Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., & Yarowsky, D. (2013). Broadly improving user classification via communication-based name and location clustering on twitter. In Proceedings of North American Chapter of the Association for Computational Linguistics (pp. 1010–1019).Google Scholar
  6. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol, CA: O’Reilly Media, Inc.Google Scholar
  7. Chambers, J., & Trudgill, P. (1998). Dialectology. Cambridge, UK: Cambridge University Press.Google Scholar
  8. Cheshire, J. (2005). Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers1. Journal of SocioLinguistics, 9(4), 479–508.CrossRefGoogle Scholar
  9. Ciot, M., Sonderegger, M., & Ruths, D. (2013). Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle (pp. 18–21).Google Scholar
  10. DOMO. (2017). Data never sleeps 5.0: How much data is generated every minute? Available at https://web-assets.domo.com/blog/wp-content/uploads/2017/07/17_domo_data-never-sleeps-5-01.png. Last accessed August 13, 2018.
  11. Doyle, G. (2014). Mapping dialectal variation by querying social media. In Proceedings of European Chapter of the Association for Computational Linguistics (pp. 98–106).Google Scholar
  12. Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 359–369).Google Scholar
  13. Eisenstein, J. (2015). Systematic patterning in phonologically-motivated orthographic variation. Journal of SocioLinguistics, 19(2), 161–188.CrossRefGoogle Scholar
  14. Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 1365–1374).Google Scholar
  15. Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. P. (2014). Diffusion of lexical change in social media. PLoS One, 9.Google Scholar
  16. Gilliéron, J., & Edmont, E. (1902–1910). Atlas linguistique de la France. Paris: Champion.Google Scholar
  17. Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Vienna: Österreichische Akademie der Wissenschaften.Google Scholar
  18. Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.CrossRefGoogle Scholar
  19. Grieve, J., Speelman, D., & Geeraerts, D. (2011). A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, 23, 193–221.CrossRefGoogle Scholar
  20. Heeringa, W. (2004). Measuring dialect pronunciation differences using Levenshtein distance (Unpublished doctoral dissertation). University of Groningen.Google Scholar
  21. Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 752–762).Google Scholar
  22. Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.Google Scholar
  23. Hovy, D., Johannsen, A., & Søgaard, A. (2015). User review sites as a resource for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (pp. 452–461). International World Wide Web Conferences Steering Committee.Google Scholar
  24. Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 59, 244–255.CrossRefGoogle Scholar
  25. Jaberg, K., Jud, J., & Scheuermeier, P. (1928–1940). Sprach- und Sachatlas Italiens und der Südschweiz. Zofingen: Ringier.Google Scholar
  26. Johannsen, A., Hovy, D., & Søgaard, A. (2015). Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (pp. 103–112).Google Scholar
  27. Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.Google Scholar
  28. Kehrein, R. (2017). Languages. In S. D. Brunn, & M. Dodge (Eds.), Mapping Across Academia (pp. 183-208). Springer, Dordrecht.Google Scholar
  29. Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data and Society, 1(1), 2053951714528481.CrossRefGoogle Scholar
  30. Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Berlin: Walter de Gruyter.Google Scholar
  31. Lameli, A. (2013). Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. In Linguistik – Impulse und Tendenzen 54. Berlin/Boston: Walter de Gruyter.Google Scholar
  32. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188–1196).Google Scholar
  33. Liu, W., & Ruths, D. (2013). What’s in a name? Using first names as features for gender inference in Twitter. In AAAI Spring Symposium: Analyzing Microtext (Vol. 13, no. 1, pp. 10–16).Google Scholar
  34. Longley, P. A., Adnan, M., & Lansley, G. (2015). The geotemporal demographics of twitter usage. Environment and Planning A, 47(2), 465–484.CrossRefGoogle Scholar
  35. Lui, M., & Baldwin, T. (2012). Langid. Py: An off-the-shelf language identification tool. In Proceedings of the Conference of the Association for Computational Linguistics, System Demonstrations (pp. 25–30). Association for Computational Linguistics.Google Scholar
  36. Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Google Scholar
  37. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).Google Scholar
  38. Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity. In D. Sankoff & J. Kruska (Eds.), Time warps, string edits and macromolecules: The theory and practice of sequence comparison (pp. v–xv). Stanford, CA: CSLI.Google Scholar
  39. Nguyen, D., Smith, N. A., & Rosé, C. P. (2011). Author age prediction from text using linear regression. In Proceedings of the 5th Association for Computational Linguistics: Human Language Technologies, Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 115–123).Google Scholar
  40. Nguyen, D., Doğruöz, A. S., Rosé, C. P., & de Jong, F. (2016). Computational sociolinguistics: A survey. Computational Linguistics, 42, 537–593.CrossRefGoogle Scholar
  41. Orton, H., & Dieth, E. (1962–1971). Survey of English dialects: Basic materials. Leeds: E.J. Arnold and Son.Google Scholar
  42. Pröll, S. (2013). Detecting structures in linguistic maps – Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing, 28, 108–118.CrossRefGoogle Scholar
  43. Purschke, C., & Hovy, D. (2019). Lörres, Möppes, and the Swiss. (Re)discovering regional patterns in anonymous social media data. Journal of Linguistic Geography, 1.Google Scholar
  44. Rabanus, S. (2019). Language Mapping Worldwide: Methods and Traditions. S. D. Brunn, & R. Kehrein (Eds.) Handbook of the Changing World Language Map. Dordrecht: Springer.Google Scholar
  45. Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta: ELRA. http://is.muni.cz/publication/884893/en.Google Scholar
  46. Rosenthal, S., & McKeown, K. (2011). Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 763–772).Google Scholar
  47. Schmidt, J. E., & Herrgen, J. (2001). Digitaler Wenker-Atlas (DiWA). Bearbeitet von Alfred Lameli, Tanja Giessler, Roland Kehrein, Alexandra Lenz, Karl-Heinz Müller, Jost Nickel, Christoph Purschke und Stefan Rabanus. Erste vollständige Ausgabe von Georg Wenkers “Sprachatlas des Deutschen Reichs”. Retrieved June 2, 2014, from http://diwa.info/titel.aspx
  48. Shackleton, R. G., Jr. (2005). English-American speech relationships: A quantitative approach. Journal of English Linguistics, 33(2), 99–160.CrossRefGoogle Scholar
  49. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.CrossRefGoogle Scholar
  50. Volkova, S.,Wilson, T., & Yarowsky, D. (2013). Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1815–1827). Association for Computational Linguistics.Google Scholar
  51. Volkova, S., Bachrach, Y., Armstrong, M., & Sharma, V. (2015). Inferring latent user properties from texts published in social media (demo). In Proceedings of the 29th Conference on Artificial Intelligence (AAAI) (pp. 4296–4297).Google Scholar
  52. Wenker, G., & Wrede, F. (1881). Der Sprachatlas des deutschen Reichs. Elwert.Google Scholar
  53. Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. The Annual Review of Linguistics, 1, 243–264.CrossRefGoogle Scholar
  54. Wieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS One, 6(9), e23613.CrossRefGoogle Scholar
  55. Wiesinger, P. (1983). Die Einteilung der deutschen Dialekte. In Werner Besch, Ulrich Knoop, Wolfgang Putschke & Herbert Ernst Wiegand (Eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807–900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin: de Gruyter.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Dirk Hovy
    • 1
    Email author
  • Afshin Rahimi
    • 2
  • Timothy Baldwin
    • 3
  • Julian Brooke
    • 4
  1. 1.Department of MarketingBocconi UniversityMilanItaly
  2. 2.Computer Science DepartmentThe University of MelbourneMelbourneAustralia
  3. 3.School of Computing and Information SystemsThe University of MelbourneMelbourneAustralia
  4. 4.Department of LinguisticsUniversity of British ColumbiaKelownaCanada

Personalised recommendations