Abstract
Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual level. Based on data-driven studies of such relationships, this paper investigates regional variation of language usage on Twitter across Europe and compares it to traditional research of regional variation. This paper presents a novel method to process large amounts of data and to capture gradual differences in language variation. Visualizing the results by deterministically translating linguistic features into color hues presents a novel view of language variation across Europe, as it is reflected on Twitter. The technique is easy to apply to large amounts of data and provides a fast visual reference that can serve as input for further qualitative studies. The general applicability is demonstrated on a number of studies both across and within national languages. This paper also discusses the unique challenges of large-scale analysis and visualization, and the complementary nature of traditional qualitative and data-driven quantitative methods, and argues for their possible synthesis.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
References
Alowibdi, J. S., Buy, U. A., & Yu, P. (2013). Empirical evaluation of profile characteristics for gender classification on Twitter. In 2013 12th International Conference on Machine Learning and Applications (ICMLA) (Vol. 1, pp. 365–369).
Auer, P., & Schmidt, J. E. (2010). Language and space: An international handbook of linguistic variation (Volume 1: Theories and methods). Berlin: DeGruyter.
Bamman, D., Dyer, C., & Smith, N. A. (2014). Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 828–834).
Benton, A., Mitchell, M., & Hovy, D. (2017). Multitask learning for mental health conditions with limited social media data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 152–162).
Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., & Yarowsky, D. (2013). Broadly improving user classification via communication-based name and location clustering on twitter. In Proceedings of North American Chapter of the Association for Computational Linguistics (pp. 1010–1019).
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Sebastopol, CA: O’Reilly Media, Inc.
Chambers, J., & Trudgill, P. (1998). Dialectology. Cambridge, UK: Cambridge University Press.
Cheshire, J. (2005). Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers1. Journal of SocioLinguistics, 9(4), 479–508.
Ciot, M., Sonderegger, M., & Ruths, D. (2013). Gender inference of Twitter users in non-English contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle (pp. 18–21).
DOMO. (2017). Data never sleeps 5.0: How much data is generated every minute? Available at https://web-assets.domo.com/blog/wp-content/uploads/2017/07/17_domo_data-never-sleeps-5-01.png. Last accessed August 13, 2018.
Doyle, G. (2014). Mapping dialectal variation by querying social media. In Proceedings of European Chapter of the Association for Computational Linguistics (pp. 98–106).
Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 359–369).
Eisenstein, J. (2015). Systematic patterning in phonologically-motivated orthographic variation. Journal of SocioLinguistics, 19(2), 161–188.
Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 1365–1374).
Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. P. (2014). Diffusion of lexical change in social media. PLoS One, 9.
Gilliéron, J., & Edmont, E. (1902–1910). Atlas linguistique de la France. Paris: Champion.
Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Vienna: Österreichische Akademie der Wissenschaften.
Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.
Grieve, J., Speelman, D., & Geeraerts, D. (2011). A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change, 23, 193–221.
Heeringa, W. (2004). Measuring dialect pronunciation differences using Levenshtein distance (Unpublished doctoral dissertation). University of Groningen.
Hovy, D. (2015). Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 752–762).
Hovy, D., & Purschke, C. (2018). Capturing regional variation with distributed place representations and geographic retrofitting. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Hovy, D., Johannsen, A., & Søgaard, A. (2015). User review sites as a resource for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web (pp. 452–461). International World Wide Web Conferences Steering Committee.
Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 59, 244–255.
Jaberg, K., Jud, J., & Scheuermeier, P. (1928–1940). Sprach- und Sachatlas Italiens und der Südschweiz. Zofingen: Ringier.
Johannsen, A., Hovy, D., & Søgaard, A. (2015). Cross-lingual syntactic variation over age and gender. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning (pp. 103–112).
Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (3rd ed.). London: Pearson.
Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data and Society, 1(1), 2053951714528481.
Kehrein, R. (2017). Languages. In Brunn, Stanley D./Dodge, Martin (eds.): Mapping Across Academia (pp. 183-208). Springer, Dordrecht.
Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Berlin: Walter de Gruyter.
Lameli, A. (2013). Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. In Linguistik – Impulse und Tendenzen 54. Berlin/Boston: Walter de Gruyter.
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188–1196).
Liu, W., & Ruths, D. (2013). What’s in a name? Using first names as features for gender inference in Twitter. In AAAI Spring Symposium: Analyzing Microtext (Vol. 13, no. 1, pp. 10–16).
Longley, P. A., Adnan, M., & Lansley, G. (2015). The geotemporal demographics of twitter usage. Environment and Planning A, 47(2), 465–484.
Lui, M., & Baldwin, T. (2012). Langid. Py: An off-the-shelf language identification tool. In Proceedings of the Conference of the Association for Computational Linguistics, System Demonstrations (pp. 25–30). Association for Computational Linguistics.
Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111–3119).
Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity. In D. Sankoff & J. Kruska (Eds.), Time warps, string edits and macromolecules: The theory and practice of sequence comparison (pp. v–xv). Stanford, CA: CSLI.
Nguyen, D., Smith, N. A., & Rosé, C. P. (2011). Author age prediction from text using linear regression. In Proceedings of the 5th Association for Computational Linguistics: Human Language Technologies, Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 115–123).
Nguyen, D., Doğruöz, A. S., Rosé, C. P., & de Jong, F. (2016). Computational sociolinguistics: A survey. Computational Linguistics, 42, 537–593.
Orton, H., & Dieth, E. (1962–1971). Survey of English dialects: Basic materials. Leeds: E.J. Arnold and Son.
Pröll, S. (2013). Detecting structures in linguistic maps – Fuzzy clustering for pattern recognition in geostatistical dialectometry. Literary and Linguistic Computing, 28, 108–118.
Purschke, C., & Hovy, D. (2019). Lörres, Möppes, and the Swiss. (Re)discovering regional patterns in anonymous social media data. Journal of Linguistic Geography, 1.
Rabanus, S. (2019). Language mapping worldwide: Methods and traditions. In: Stanley D. Brunn and Roland Kehrein (Eds.) Handbook of the Changing World Language Map. Dordrecht: Springer.
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta: ELRA. http://is.muni.cz/publication/884893/en.
Rosenthal, S., & McKeown, K. (2011). Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 763–772).
Schmidt, J. E., & Herrgen, J. (2001). Digitaler Wenker-Atlas (DiWA). Bearbeitet von Alfred Lameli, Tanja Giessler, Roland Kehrein, Alexandra Lenz, Karl-Heinz Müller, Jost Nickel, Christoph Purschke und Stefan Rabanus. Erste vollständige Ausgabe von Georg Wenkers “Sprachatlas des Deutschen Reichs”. Retrieved June 2, 2014, from http://diwa.info/titel.aspx
Shackleton, R. G., Jr. (2005). English-American speech relationships: A quantitative approach. Journal of English Linguistics, 33(2), 99–160.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Volkova, S.,Wilson, T., & Yarowsky, D. (2013). Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1815–1827). Association for Computational Linguistics.
Volkova, S., Bachrach, Y., Armstrong, M., & Sharma, V. (2015). Inferring latent user properties from texts published in social media (demo). In Proceedings of the 29th Conference on Artificial Intelligence (AAAI) (pp. 4296–4297).
Wenker, G., & Wrede, F. (1881). Der Sprachatlas des deutschen Reichs. Elwert.
Wieling, M., & Nerbonne, J. (2015). Advances in dialectometry. The Annual Review of Linguistics, 1, 243–264.
Wieling, M., Nerbonne, J., & Baayen, H. R. (2011). Quantitative social dialectology: Explaining linguistic variation geographically and socially. PLoS One, 6(9), e23613.
Wiesinger, P. (1983). Die Einteilung der deutschen Dialekte. In Werner Besch, Ulrich Knoop, Wolfgang Putschke & Herbert Ernst Wiegand (eds.), Dialektologie: ein Handbuch zur deutschen und allgemeinen Dialektforschung Vol. 2, 807–900. (Handbooks of Linguistics and Communication Science. 1.2). Berlin: de Gruyter.
Publisher’s note:
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Hovy, D., Rahimi, A., Baldwin, T., Brooke, J. (2019). Visualizing Regional Language Variation Across Europe on Twitter. In: Brunn, S., Kehrein, R. (eds) Handbook of the Changing World Language Map. Springer, Cham. https://doi.org/10.1007/978-3-319-73400-2_175-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-73400-2_175-1
Received:
Accepted:
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73400-2
Online ISBN: 978-3-319-73400-2
eBook Packages: Springer Reference Earth and Environm. ScienceReference Module Physical and Materials ScienceReference Module Earth and Environmental Sciences