Skip to main content

Specific Challenges of Variation and Text Types

  • Chapter
  • First Online:
  • 83 Accesses

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

Abstract

One fascinating aspect of language identification which makes it difficult is the similarity between languages. Some languages seem to be extremely easy to distinguish from each other, whereas for some others, it is extremely difficult. This phenomenon is closely tied to the definition of “language”, which is much less trivial than what one might think. It is hard to draw the line between languages and dialects. For example, mutual intelligibility is one of the measures often mentioned, but this is highly subjective and very difficult to measure objectively. Several organizations have defined lists of languages. Ethnologue: Languages of the World is currently in its 25th edition, and lists 7,168 known living languages. It is published by the SIL International, which is also responsible for the ISO 639-3 standard consisting of three-letter codes representing individual languages. Library of Congress is the registration authority for the ISO 639-2 standard consisting of the ISO 639-3 compatible three-letter codes for a considerably smaller number of languages, still continuously updated as well. Glottolog, published by the Max Planck Institute, lists 8,572 entries in its version 4.7. Linguasphere Register volume two includes over 30,000 languages and dialects. Of these lists, ISO 639-3 and its subset ISO 639-2 are the most widely used even though the two-letter codes from ISO 639-1 are still in use on many occasions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   32.09
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR   43.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.ethnologue.com.

  2. 2.

    https://iso639-3.sil.org.

  3. 3.

    https://www.loc.gov/standards/iso639-2.

  4. 4.

    https://glottolog.org.

  5. 5.

    http://www.linguasphere.info.

  6. 6.

    http://raamattu.fi/1992/Luuk.2.html.

  7. 7.

    https://keskustelu.suomi24.fi/t/11135014/jouluevankeliumi-mean-kielela.

  8. 8.

    http://nappablog.blogspot.com/2013/12/jouluevankeliumi-rauman-gialel.html.

  9. 9.

    https://iso639-3.sil.org/code_tables/download_tables.

  10. 10.

    https://depts.washington.edu/uwcl/odin/.

References

  • S. Abney, S. Bird, The human language project: Building a universal corpus of the world’s languages, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. (Association for Computational Linguistics, 2010), pp. 88–97. https://aclanthology.org/P10-1010

  • I. Adebara, A. Elmadany, M. Abdul-Mageed, A. Inciarte, AfroLID: a neural language identification tool for African languages, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates (Association for Computational Linguistics, 2022), pp. 1958–1981. https://aclanthology.org/2022.emnlp-main.128

  • M. Al-Badrashiny, M. Diab, LILI: a simple language independent approach for language identification, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1211–1219, Osaka, Japan (2016b). The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1115

  • A. Ali, N. Dehak, P. Cardinal, S. Khurana, S. H. Yella, J. Glass, P. Bell, S. Renals, Automatic dialect detection in Arabic broadcast speech, in Proceedings of Interspeech 2016, San Francisco, USA (2016), pp. 2934–2938

    Google Scholar 

  • A. Alshutayri, E. Atwell, Exploring twitter as a source of an arabic dialect corpus. Int. J. Comput. Linguist. (IJCL) 8(2), 37–44 (2017)

    Google Scholar 

  • T. Baldwin, M. Lui, Language identification: The long and the short of the matter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, USA (2010b), pp. 229–237. Association for Computational Linguistics. https://aclanthology.org/N10-1027

  • S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, T. Wilson, language identification for creating language-specific twitter collections, in Proceedings of the Second Workshop on Language in Social Media (LSM2012), Montréal, Canada (2012), pp. 65–74

    Google Scholar 

  • Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 115–123. https://doi.org/10.18653/v1/W17-1214, https://aclanthology.org/W17-1214

  • H. Bouamor, N. Habash, K. Oflazer, A multidialectal parallel corpus of Arabic, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland (European Language Resources Association (ELRA), 2014), pp. 1240–1245. http://www.lrec-conf.org/proceedings/lrec2014/pdf/523_Paper.pdf

  • R.D. Brown, Selecting and weighting N-grams to identify 1100 languages, in Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD 2013), Plzeň, Czech Republic (2013), pp. 475–483

    Google Scholar 

  • R.D. Brown, Finding and identifying text in 900+ languages. Digit. Investig. 9, S34–S43 (2012)

    Article  Google Scholar 

  • I. Caswell, T. Breiner, D. van Esch, A. Bapna, Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus, in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online) (International Committee on Computational Linguistics, 2020), pp. 6588–6608. https://doi.org/10.18653/v1/2020.coling-main.579, https://www.aclweb.org/anthology/2020.coling-main.579

  • B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 1–11. https://www.aclweb.org/anthology/2021.vardial-1.1

  • K. Church, Stress assignment in letter to sound rules for speech synthesis, in 23rd Annual Meeting of the Association for Computational Linguistics, Chicago, Illinois, USA (1985), pp. 246–253. Association for Computational Linguistics. https://doi.org/10.3115/981210.981240, https://aclanthology.org/P85-1030

  • A.M. Ciobanu, L.P. Dinu, A computational perspective on the Romanian dialects, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia (European Language Resources Association (ELRA), 2016), pp. 3281–3285. https://aclanthology.org/L16-1522

  • M. Clyne, Pluricentric Languages: Different Norms in Different Nations (CRC Press, Boca Raton, USA, 1992)

    Google Scholar 

  • R. Cotterell, C. Callison-Burch, A multi-dialect, multi-genre corpus of informal written Arabic, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland (European Language Resources Association (ELRA), 2014), pp. 241–245. http://www.lrec-conf.org/proceedings/lrec2014/pdf/641_Paper.pdf

  • J. Cowie, Y. Ludovik, R. Zacharski, Language recognition for mono- and multi-lingual documents, in Proceedings of the VexTal Conference, Venice, Italy (1999), pp. 209–214

    Google Scholar 

  • S. Diwersy, S. Evert, S. Neumann, A Weakly Supervised Multivariate Approach to the Study of Language Variation, in Aggregating Dialectology, Typology, and Register Analysis. ed. by B. Szmrecsanyi, B. Wälchli (Linguistic Variation in Text and Speech. De Gruyter, Berlin, 2014)

    Google Scholar 

  • J. Eisenstein, Identifying Regional Dialects in On-Line Social Media, in ed. by C. Boberg, J. Nerbonne, D. Watt. Handbook of Dialectology. (Wiley, 2017)

    Google Scholar 

  • H. Elfardy, M. Diab. Sentence level dialect identification in Arabic, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria (Association for Computational Linguistics, 2013), pp. 456–461. https://aclanthology.org/P13-2081

  • H.A. Elgabou, D. Kazakov, Building dialectal arabic corpora, in The Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT), Varna, Bulgaria (2017), pp. 52–57

    Google Scholar 

  • G. Emerson, L. Tan, S. Fertmann, A. Palmer, M. Regneri, SeedLing: building and using a seed corpus for the human language project, in Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, Baltimore, USA (2014), pp. 77–85. http://www.aclweb.org/anthology/W14-2211

  • C. Goutte, S. Léger, S. Malmasi, M. Zampieri, Discriminating similar languages: evaluations and explorations, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, (European Language Resources Association (ELRA), 2016), pp. 1800–1807. https://aclanthology.org/L16-1284

  • G. Grefenstette, Comparing two language identification schemes, in Proceedings of the 3rd International conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (1995)

    Google Scholar 

  • H. Hammarström, A fine-grained model for language identification, in Proceedings of Improving Non English Web Searching (iNEWS-07) Workshop at SIGIR 2007, Amsterdam, Netherlands (2007), pp. 14–20

    Google Scholar 

  • E. Haugen, Dialect, Language. Nation. Am. Anthropol. 68(4), 922–935 (1966)

    Article  Google Scholar 

  • C.-R. Huang, L.-H. Lee, Contrastive approach towards text source classification based on top-bag-of-word similarity, in Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, Cebu City, Philippines (2008), pp. 404–410

    Google Scholar 

  • B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification for written language resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. (European Language Resources Association (ELRA), 2006). http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf

  • H. Jauhiainen, T. Jauhiainen, K. Lindén, Building web corpora for minority languages, in Proceedings of the 12th Web as Corpus Workshop, Marseille, France, May (2020a), pp. 23–32. European Language Resources Association. ISBN 979-10-95546-68-9. https://www.aclweb.org/anthology/2020.wac-1.4

  • T. Jauhiainen, H. Jauhiainen, K. Linden, Suomalais-ugrilaiset kielet ja internet-projekti 2013–2019 (University of Helsinki Library, In Multilingual Facilitation, 2021)

    Google Scholar 

  • H. Jauhiainen, T. Jauhiainen, K. Lindén, The Finno-Ugric languages and the internet project. Septentrio Conf. Ser. 0 (2), 87–98 (2015a). ISSN 2387-3086. https://doi.org/10.7557/5.3471

  • H. Jauhiainen, T. Jauhiainen, K. Linden, Wanca in Korp: Text corpora for underresourced Uralic languages, in J. Jantunen, S. Brunni, N. Kunnas, S. Palviainen, K. Västi, ed. by Proceedings of the Research data and Humanities (RDHUM) 2019 Conference, Studia Humaniora Ouluensia, vol. 17, Finland (2019a), pp. 21–40. University of Oulu. ISBN 978-952-62-2320-9

    Google Scholar 

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothenburg, Sweden (2017a), pp. 183–191. Association for Computational Linguistics. https://www.aclweb.org/anthology/W17-0221

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, HeLI, a word-based backoff method for language identification, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (2016), pp. 153–162. The COLING 2016 Organizing Committee. https://www.aclweb.org/anthology/W16-4820

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Language set identification in noisy synthetic multilingual documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), Cairo, Egypt (2015c), pp. 633–643

    Google Scholar 

  • S. Konstantopoulos, What’s in a Name?, in Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP-07), Borovets, Bulgaria (2007)

    Google Scholar 

  • D. Kosmajac, V. Keselj, Slavic language identification using cascade classifier approach, in Proceedings of the 17th International Symposium INFOTEH-JAHORINA (INFOTEH 2018), East Sarajevo, Bosnia-Herzegovina. (IEEE, 2018)

    Google Scholar 

  • W.D. Lewis, F. Xia, Developing ODIN: a multilingual repository of annotated language data for hundreds of the world’s languages. Literary Linguist. Comput. 25(3), 303–319 (2010)

    Article  Google Scholar 

  • T. Lippincott, B. Van Durme, Active learning and negative evidence for language identification, in Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances (Association for Computational Linguistics, 2021), pp. 47–51. https://www.aclweb.org/anthology/2021.dash-1.8

  • N. Ljubešić, D. Kranjcić, Discriminating between closely related languages on twitter. Informatica 39 (2015)

    Google Scholar 

  • N. Ljubešić, D. Kranjcić, Discriminating between VERY similar languages among twitter users, in Proceedings of the 9th Language Technologies Conference, Ljubljana, Slovenia (2014), pp. 90–94

    Google Scholar 

  • N. Ljubešić, N. Mikelić, D. Boras, Language identification: how to distinguish similar languages? in Proceedings of the 29th International Conference on Information Technology Interfaces (ITI 2007), Cavtat/Dubrovnik, Croatia (2007), pp. 541–546

    Google Scholar 

  • M. Lui, T. Baldwin, Accurate language identification of Twitter messages, in Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 17–25, Gothenburg, Sweden (2014). Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-1303, https://aclanthology.org/W14-1303

  • M. Lui, P. Cook, Classifying English documents by national dialect, in Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013), Brisbane, Australia (2013), pp. 5–15. https://aclanthology.org/U13-1003

  • M. Lui, N. Letcher, O. Adams, L. Duong, P. Cook, T. Baldwin, Exploring methods and resources for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014b), pp. 129–138. https://doi.org/10.3115/v1/W14-5315. https://aclanthology.org/W14-5315

  • W. Maier, C. Gómez-Rodríguez, Language variety identification in Spanish tweets, in Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, Doha, Qatar (Association for Computational Linguistics,2014), pp. 25–35. https://doi.org/10.3115/v1/W14-4204. URL https://aclanthology.org/W14-4204

  • S. Malmasi, E. Refaee, M. Dras, Arabic dialect identification using a parallel multidialectal corpus, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, Bali, Indonesia (2015), pp. 209–217

    Google Scholar 

  • S. Malmasi, M. Zampieri, Arabic dialect identification in speech transcripts, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 106–113. https://aclanthology.org/W16-4814

  • T. Mandl, M. Shramko, O. Tartakovski, C. Womser-Hacker, Language identification in multi-lingual web-documents, in Proceedings of the 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006), Klagenfurt, Austria (2006), pp. 153–163

    Google Scholar 

  • A. Mielikäinen, Liudennus murretutkimuksessa ja savolaismurteisessa kirjallisuudessa. Virittäjä 108(4), 508–530 (2004)

    Google Scholar 

  • S.A. Mokhov, A MARF approach to DEFT 2010, in Proceedings of the 6th DEFT Workshop (DEFT’10) (2010b), pp. 35–49

    Google Scholar 

  • S.A. Mokhov, Complete Complimentary Results Report of the MARF’s NLP Approach to the DEFT 2010 Competition (2010a). CoRR, abs/1006.3787

    Google Scholar 

  • K.N. Murthy, G.B. Kumar, Language identification from small text samples. J. Quant. Linguist. 13(1), 57–80 (2006)

    Article  Google Scholar 

  • Z. Obermeyer, E.J. Emanuel, Predicting the future - big data, machine learning, and clinical medicine. New England J. Med. 375, 1216–1219 (2016)

    Article  Google Scholar 

  • I. Piippo, J. Vaattovaara, E. Voutilainen, Kieli, tuo viekas seuralainen. Puhe ja kieli 37(1), 43–48 (2017)

    Google Scholar 

  • J.M. Prager, Linguini: language identification for multilingual documents, in Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA (1999)

    Google Scholar 

  • B. Ranaivo-Malançon, Automatic identification of close languages – case study: Malay and Indonesian. ECTI Trans. Comput. Inf. Technol. 2(2), 126–134 (2006)

    Google Scholar 

  • F. Sadat, F. Kazemi, A. Farzindar, Automatic identification of Arabic dialects in social media, in Proceedings of the first international workshop on Social media retrieval and analysis (SoMeRA 2014), Gold Coast, QLD, Australia (ACM, 2014), pp. 35–40

    Google Scholar 

  • W. Salloum, H. Elfardy, L. Alamir-Salloum, N. Habash, M. Diab, Sentence level dialect identification for machine translation system selection, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland. (Association for Computational Linguistics, 2014), pp. 772–778. https://doi.org/10.3115/v1/P14-2125. https://aclanthology.org/P14-2125

  • K.P. Scannell, The Crúbadán project: corpus building for under-resourced languages, in Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium (2007), pp. 5–15

    Google Scholar 

  • G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in P. Langley, ed. by Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), Stanford, CA, USA (2000), pp. 839–846

    Google Scholar 

  • V. Simaki, P. Simakis, C. Paradis, A. Kerren, Identifying the authors’ national variety of English in social media texts, in Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria (INCOMA Ltd, 2017), pp. 671–678. https://doi.org/10.26615/978-954-452-049-6_086

  • G.F. Simons, C.D. Fennig, editors. Ethnologue: Languages of the World, Twenty-first Edition. SIL International, Dallas, Texas (2018). Online version: http://www.ethnologue.com

  • A.K. Singh, J. Gorla, Identification of languages and encodings in a multilingual document, in Proceedings of the 3rd ACL SIGWAC Workshop on Web As Corpus (WAC3-2007), Louvain-la-Neuve, Belgium (2007), pp. 95–108

    Google Scholar 

  • I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, A language and character set determination method based on \(n\)-gram statistics. ACM Trans Asian Lang Inf Process (TALIP) 1(3), 269–278 (2002)

    Article  Google Scholar 

  • J. Tiedemann, N. Ljubešić, Efficient discrimination between closely related languages, in Proceedings of COLING 2012, Mumbai, India (The COLING 2012 Organizing Committee, 2012), pp. 2619–2634. https://www.aclweb.org/anthology/C12-1160

  • C. Tillmann, S. Mansour, Y. Al-Onaizan, Improved sentence-level Arabic dialect classification, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014), pp. 110–119. https://doi.org/10.3115/v1/W14-5313, https://aclanthology.org/W14-5313

  • D. Trieschnigg, D. Hiemstra, M. Theune, F. de Jong, T. Meder, An exploration of language identification techniques for the Dutch Folktale database, in Proceedings of the LREC Workshop Adaptation of Language Resources and Tools for Processing Cultural Heritage, Istanbul, Turkey (2012), pp. 47–51

    Google Scholar 

  • C. van der Lee, A. van den Bosch, Exploring lexical and syntactic features for language variety identification, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 190–199. https://doi.org/10.18653/v1/W17-1224, https://aclanthology.org/W17-1224

  • T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with n-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta (2010). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf

  • T. Vitale, An algorithm for high accuracy name pronunciation by parametric speech synthesizer. Comput. Linguist. 17(3), 257–276 (1991)

    Google Scholar 

  • S. Wray, Classification of closely related sub-dialects of Arabic using support-vector machines, in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (European Language Resources Association (ELRA), 2018), https://aclanthology.org/L18-1580

  • F. Xia, W. Lewis, M.W. Goodman, J. Crowgey, E.M. Bender, Enriching ODIN, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland (2014), pp. 3151–3157. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1072_Paper.pdf

  • F. Xia, C. Lewis, W.D. Lewis, The problems of language identification within hugely multilingual data sets, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta (2010). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/921_Paper.pdf

  • F. Xia, W. Lewis, H. Poon, Language ID in the context of harvesting language data off the web, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece (2009), pp. 870–878. Association for Computational Linguistics. https://aclanthology.org/E09-1099

  • H. Yamaguchi, K. Tanaka-Ishii, Text segmentation by language using minimum description length, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea (2012), pp. 969–978. Association for Computational Linguistics. https://aclanthology.org/P12-1102

  • O.F. Zaidan, C. Callison-Burch, The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA .(Association for Computational Linguistics, 2011), pp. 37–41. https://aclanthology.org/P11-2007

  • O.F. Zaidan, C. Callison-Burch, Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

    Article  Google Scholar 

  • M. Zampieri, Using bag-of-words to distinguish similar languages: how efficient are they? in Proceedings of the 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary (2013), pp. 37–41

    Google Scholar 

  • M. Zampieri, B.G. Gebre, Automatic identification of language varieties: the case of Portuguese, in Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), Vienna, Austria (2012), pp. 233–237

    Google Scholar 

  • M. Zampieri, B. Gebre, VarClass: an open-source language identification tool for language varieties, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland (European Language Resources Association (ELRA), 2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/996_Paper.pdf

  • M. Zampieri, B.G. Gebre, S. Diwersy, N-gram language models and POS distribution for the identification of Spanish varieties, in Proceedings of la 20ème conférence du Traitement Automatique du Langage Naturel (TALN), Sables d’Olonne, France (2013), pp. 580–587

    Google Scholar 

  • M. Zampieri, S. Malmasi, O.-M. Sulea, L.P. Dinu, A computational approach to the study of Portuguese newspapers published in Macau, in Proceedings of the Workshop on Natural Language Processing meets Journalism (NLPMJ 2016), New York City, NY, USA (2016), pp. 47–51

    Google Scholar 

  • M. Zampieri, S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, D. Speelman, A. van den Bosch, R. Kumar, B. Lahiri, M. Jain, Language identification and morphosyntactic tagging: The second VarDial evaluation campaign, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (2018a), pp. 1–17. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3901

  • A. Zečević, S. Vujičić Stanković, The mysterious letter J, in Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants, Hissar, Bulgaria (2013), pp. 40–44. INCOMA Ltd. Shoumen, BULGARIA. https://aclanthology.org/W13-5307

  • A. Zubiaga, I. San Vicente, P. Gamallo, J.R. Pichel, I. Alegria, N. Aranberri, A. Ezeiza, V. Fresno, Overview of TweetLID: tweet language identification at SEPLN 2014, in Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014), Girona, Spain (2014), pp. 1–11

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Jauhiainen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Specific Challenges of Variation and Text Types. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45822-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45821-7

  • Online ISBN: 978-3-031-45822-4

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics