Skip to main content

Introduction to Language Identification

  • Chapter
  • First Online:
  • 93 Accesses

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

Abstract

Language identification (LI) is the task of predicting the language(s) in a text or speech input. The main difference between LI of text and speech is that the characters that make up the text are discrete, whereas with speech, the input is usually a continuous signal. This means that different styles of mathematical methods are needed to process text and speech, traditionally with little methodological overlap between them. In this book, we focus on the language identification of digital text, although we do touch on applications to speech in the case that the speech signal has been translated into a sequence of (discrete) phones. Recognizing the language(s) that a text is written in comes naturally to a human reader familiar with the language(s). Table 1.1 presents excerpts from Wikipedia articles in four different European languages on the topic of Natural Language Processing (NLP), labeled according to the language they are written in. Without referring to the labels, readers of this book will certainly recognize at least one language, and many are likely to identify all of them, even if they can’t read the content in all cases.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   32.09
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR   43.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    We were unable to obtain the original article, so our account of the paper is based on the abstract and reports in later published articles.

  2. 2.

    http://corporavm.uni-koeln.de/vardial/.

  3. 3.

    http://ttg.uni-saarland.de/lt4vardial2015/.

  4. 4.

    http://ttg.uni-saarland.de/vardial2016/.

  5. 5.

    http://ttg.uni-saarland.de/vardial2017/.

  6. 6.

    http://alt.qcri.org/vardial2018/.

  7. 7.

    https://sites.google.com/view/vardial2019/campaign.

  8. 8.

    https://sites.google.com/view/vardial2020/evaluation-campaign.

  9. 9.

    https://sites.google.com/view/vardial2021/evaluation-campaign.

  10. 10.

    https://sites.google.com/view/vardial-2022/shared-tasks.

  11. 11.

    Optical Character Recognition (OCR).

  12. 12.

    http://urn.fi/urn:nbn:fi:lb-202009152.

  13. 13.

    https://translate.google.com.

  14. 14.

    http://urn.fi/urn:nbn:fi:lb-2020022901.

References

  • W. Adouane, Automatic Detection of Under resourced languages: Dialectal Arabic short texts. Master’s thesis, University of Gothenburg, Gothenburg, Sweden (2016)

    Google Scholar 

  • B. Alex, Automatic Detection of English Inclusions in Mixed-lingual Data with an Application to Parsing. Ph.D. thesis, The University of Edinburgh (2008)

    Google Scholar 

  • T. Baldwin, M. Lui, Language identification: the long and the short of the matter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237, Los Angeles, CA, USA, June 2010b. Association for Computational Linguistics. https://aclanthology.org/N10-1027

  • K.R. Beesley, Language identifier: a computer program for automatic natural-language identification of on-line text, in Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads, pp. 47–54, Seattle, USA (1988)

    Google Scholar 

  • S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, T. Wilson, Language identification for creating language-specific twitter collections, in Proceedings of the Second Workshop on Language in Social Media (LSM2012), pp. 65–74, Montréal, Canada (2012)

    Google Scholar 

  • G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 128–134, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.15

  • S.L. Blodgett, J. Wei, B. O’Connor, A dataset and classifier for recognizing social media English, in Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 56–61, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-4408. https://aclanthology.org/W17-4408

  • V. Bobicev, Discriminating between similar languages using PPM, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 59–65, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://aclanthology.org/W15-5410

  • W. Bright, Notes. Lang. Soc. 26(3), 469–470 (1997). https://doi.org/10.1017/S0047404500019679

    Article  Google Scholar 

  • R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 627–632, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1069. https://aclanthology.org/D14-1069

  • R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)

    Article  Google Scholar 

  • I. Caswell, T. Breiner, D. van Esch, A. Bapna, Language ID in the wild: unexpected challenges on the path to a thousand-language web text corpus, in Proceedings of the 28th International Conference on Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579. https://www.aclweb.org/anthology/2020.coling-main.579

  • W.B. Cavnar, J.M. Trenkle, N-Gram-Based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175, Las Vegas, USA (1994)

    Google Scholar 

  • B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–11, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.1

  • K. Church, Stress assignment in letter to sound rules for speech synthesis, in 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246–253, Chicago, Illinois, USA, July 1985. Association for Computational Linguistics. https://doi.org/10.3115/981210.981240. https://aclanthology.org/P85-1030

  • M. Clyne, Pluricentric Languages: Different Norms in Different Nations (CRC Press, Boca Raton, USA, 1992)

    Google Scholar 

  • F. Debole, F. Sebastiani, An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 56(6), 584–596 (2005)

    Article  Google Scholar 

  • A. Elnagar, S.M. Yagi, A.B. Nassif, I. Shahin, S.A. Salloum, Systematic literature review of Dialectal Arabic: identification and detection. IEEE Access 9, 31010–31042 (2021). https://doi.org/10.1109/ACCESS.2021.3059504

    Article  Google Scholar 

  • R. Eskander, M. Al-Badrashiny, N. Habash, O. Rambow, Foreign words and the automatic processing of Arabic social media text written in Roman script, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 1–12, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3901. https://aclanthology.org/W14-3901

  • M. Gaman, D. Hovy, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, C. Purschke, Y. Scherrer, M. Zampieri, A Report on the VarDial Evaluation Campaign 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–14, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.1

  • E.M. Gold, Language identification in the limit. Inf. Control 10(5), 447–474 (1967)

    Article  MathSciNet  Google Scholar 

  • C. Goutte, S. Léger, M. Carpuat, The NRC system for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 139–145, Dublin, Ireland, Aug. 2014. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5316. https://aclanthology.org/W14-5316

  • G. Grefenstette, Comparing two language identification schemes, in Proceedings of the 3rd International conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (1995)

    Google Scholar 

  • H. Hammarström, A fine-grained model for language identification, in Proceedings of Improving Non English Web Searching (iNEWS-07) Workshop at SIGIR 2007, pp. 14–20, Amsterdam, Netherlands (2007)

    Google Scholar 

  • L. Hinguruduwa, E. Marx, T. Soru, T. Riechert, Assessing language identification over DBpedia, in 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pp. 296–297 (2021). https://doi.org/10.1109/ICSC50631.2021.00084

  • A.S. House, E.P. Neuburg, Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)

    Google Scholar 

  • B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification for written language resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May 2006. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf

  • N. Ingle. A language Identification Table. Technical Translation International (1980)

    Google Scholar 

  • T. Jauhiainen, Language identification in texts. Ph.D. thesis, University of Helsinki, Finland (2019)

    Google Scholar 

  • H. Jauhiainen, T. Jauhiainen, K. Lindén, Building Web Corpora for Minority Languages, in Proceedings of the 12th Web as Corpus Workshop, pp. 23–32, Marseille, France, May 2020a. European Language Resources Association. ISBN 979-10-95546-68-9. https://www.aclweb.org/anthology/2020.wac-1.4

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating similar languages with token-based backoff, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 44–51, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5408

  • H. Jauhiainen, T. Jauhiainen, K. Linden, Wanca in Korp: text corpora for under resourced Uralic languages, in Proceedings of the Research data and humanities (RDHUM) 2019 conference, number 17 in Studia Humaniora Ouluensia, ed. by J. Jantunen, S. Brunni, N. Kunnas, S. Palviainen, K. Västi, pp. 21–40, Finland. University of Oulu (2019a). ISBN 978-952-62-2320-9

    Google Scholar 

  • T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 173–185, Barcelona, Spain (Online), Dec. 2020c. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.16

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 183–191, Gothenburg, Sweden, May 2017a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W17-0221

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Language set identification in noisy synthetic multilingual documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), pp. 633–643, Cairo, Egypt (2015c)

    Google Scholar 

  • T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, K. Lindén, Automatic language identification in texts: a survey. J. Artif. Intell. Res. 65, 675–782 (2019e). ISSN 1076-9757. https://doi.org/10.1613/jair.1.11675

  • B. King, S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110–1119, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1131

  • B. King, D. Radev, S. Abney, Experiments in Sentence Language Identification with Groups of Similar Languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 146–154, Dublin, Ireland, Aug. 2014a. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5317. https://aclanthology.org/W14-5317

  • T. Kocmi, O. Bojar, LanideNN: multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 927–936, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1087

  • A. Kralisch, T. Mandl, Barriers to information access across languages on the internet: network and language effects, in Proceedings of the 39th Annual Hawaii International Conference on System Sciences, vol. 3, p. 54b, Kauai, USA (2006)

    Google Scholar 

  • Y. Li, T. Baldwin, T. Cohn, What’s in a Domain? learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 474–479, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2076. https://aclanthology.org/N18-2076

  • N. Ljubešić, D. Kranjcić, Discriminating between closely related languages on twitter. Informatica 39 (2015)

    Google Scholar 

  • N. Ljubešić, D. Kranjcić, Discriminating between VERY Similar Languages among Twitter Users, in Proceedings of the 9th Language Technologies Conference, pp. 90–94, Ljubljana, Slovenia (2014)

    Google Scholar 

  • N. Ljubešić, A. Toral, caWaC—a web corpus of Catalan and its application to language modeling and machine translation, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1728–1732, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/841_Paper.pdf

  • E. Loginova, S. Varanasi, G. Neumann, Towards end-to-end multilingual question answering. Inf. Syst. Front. 23(1), 227–241 (2021)

    Article  Google Scholar 

  • M. Lui, T. Baldwin, Accurate language identification of Twitter messages, in Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pp. 17–25, Gothenburg, Sweden, Apr. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-1303. https://aclanthology.org/W14-1303

  • M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014)

    Google Scholar 

  • M. Majliš, Yet Another Language Identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 46–54, Avignon, France, Apr. 2012. Association for Computational Linguistics. https://aclanthology.org/E12-3006

  • S. Malmasi, Open-Set language identification (2017). arXiv:1707.04817

  • S. Malmasi, M. Dras, Automatic language identification for Persian and Dari Texts, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, pp. 59–64, Bali, Indonesia (2015a)

    Google Scholar 

  • S. Malmasi, E. Refaee, M. Dras, Arabic dialect identification using a parallel multidialectal corpus, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, pp. 209–217, Bali, Indonesia (2015)

    Google Scholar 

  • S. Malmasi, M. Zampieri, Arabic dialect identification in speech transcripts, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://aclanthology.org/W16-4814

  • S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://www.aclweb.org/anthology/W16-4801

  • P. McNamee, Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)

    Google Scholar 

  • E. Miller, An introduction to the resource description framework. Bull. Am. Soc. Inf. Sci. Technol. 25(1), 15–19 (1998)

    Article  Google Scholar 

  • G. Mohr, M. Stack, I. Rnitovic, D. Avery, M. Kimpton, Introduction to Heritrix, in 4th International Web Archiving Workshop, Bath, UK (2004)

    Google Scholar 

  • S. Mustonen, Multiple discriminant analysis in linguistic problems. Stat. Methods Linguist. 4, 37–44 (1965)

    Google Scholar 

  • Y. Nakamura, Identification of languages with short sample texts—a linguometric study. Libr. Inf. Sci. 9, 459–481 (1971)

    Google Scholar 

  • P. Nakov, M. Zampieri, N. Ljubešić, J. Tiedemann, S. Malmasi, A. Ali (eds.), in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-12. https://www.aclweb.org/anthology/W17-1200

  • P. Nakov, M. Zampieri, P. Osenova, L. Tan, C. Vertan, N. Ljubešić, J. Tiedemann (eds.), in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5400

  • P. Nakov, M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, S. Malmasi (eds.), in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://www.aclweb.org/anthology/W16-4800

  • P. Newman, Foreign Language Identification: First Step in the Translation Process. Technical report, Sandia National Labs., Albuquerque, NM (USA) (1987)

    Google Scholar 

  • A. Patwari, N. Kong, J. Wang, U. Gargi, Y. Music, M. Covell, A. Jansen, Semantically meaningful attributes from co-listen embeddings for playlist exploration and expansion, in Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR (2020)

    Google Scholar 

  • J. Porta, J.-L. Sancho, Using maximum entropy models to discriminate between similar languages and varieties, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 120–128, Dublin, Ireland, Aug. 2014. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5314. https://aclanthology.org/W14-5314

  • K.A. Rafidha Rehiman, A.S. Keerthy, K.S. Lakshmi, A. Sreekumar, A language identification and conversion system for Malayalam to ensure security, in 3rd National Conference on Indian Language Computing (NCILC 2013), Cochin, Kerala, India (2013)

    Google Scholar 

  • M.D. Rau, Language Identification by Statistical Analysis. Master’s thesis, Naval Postgraduate School, Monterey (1974)

    Google Scholar 

  • P. Rodrigues, Processing Highly Variant Language Using Incremental Model Selection. Ph.D. thesis, Indiana University (2012)

    Google Scholar 

  • N.C. Rowe, R. Schwamm, S.L. Garfinkel, Language translation for file paths. Digit. Investig. 10 (2013)

    Google Scholar 

  • F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  • G.F. Simons, C.D. Fennig (eds.), Ethnologue: Languages of the World, 21st ed. (SIL International, Dallas, Texas, 2018). http://www.ethnologue.com

  • S.C. Tratz, Accurate Arabic Script Language/Dialect Classification Technical report, Army Research Laboratory (2014)

    Google Scholar 

  • G. van Noord, TextCat (1997). http://odur.let.rug.nl/~vannoord/TextCat/

  • M. van der Wees, A. Bisazza, W. Weerkamp, C. Monz, What’s in a Domain? Analyzing genre and topic differences in statistical machine translation, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 560–566, Beijing, China, July 2015. Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-2092. https://aclanthology.org/P15-2092

  • T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with N-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf

  • J. Vogel, D. Tresner-Kirsch, Robust language identification in short, Noisy Texts: improvements to LIGA, in Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), ed. by M. Atzmueller, H. Andreas, pp. 43–50, Bristol, UK (2012)

    Google Scholar 

  • C. Voss, S. Tratz, J. Laoudi, D. Briesch, Finding Romanized Arabic dialect in code-mixed tweets, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 2249–2253, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf

  • F. Xia, W. Lewis, H. Poon, Language ID in the context of harvesting language data off the web, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 870–878, Athens, Greece, Mar. 2009. Association for Computational Linguistics. https://aclanthology.org/E09-1099

  • O.F. Zaidan, C. Callison-Burch, Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)

    Article  Google Scholar 

  • M. Zampieri, B.G. Gebre, Automatic identification of language varieties: the case of portuguese, in Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), pp. 233–237, Vienna, Austria (2012)

    Google Scholar 

  • M. Zampieri, S. Malmasi, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Y. Scherrer, N. Aepli, Findings of the VarDial evaluation campaign 2017, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 1–15, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-1201. https://www.aclweb.org/anthology/W17-1201

  • M. Zampieri, S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, D. Speelman, A. van den Bosch, R. Kumar, B. Lahiri, M. Jain, Language identification and morphosyntactic tagging: the second VarDial evaluation campaign, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3901

  • M. Zampieri, S. Malmasi, Y. Scherrer, T. Samardžić, F. Tyers, M. Silfverberg, N. Klyueva, T.-L. Pan, C.-R. Huang, R.T. Ionescu, A.M. Butnaru, T. Jauhiainen, A report on the third VarDial evaluation campaign, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–16, Ann Arbor, Michigan, June 2019a. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1401. https://www.aclweb.org/anthology/W19-1401

  • M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, S. Malmasi, A. Ali (eds.), Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, Aug. 2018b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3900

  • M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, Y. Scherrer (eds.), in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.0

  • M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, Y. Scherrer, T. Jauhiainen (eds.), in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://aclanthology.org/2021.vardial-1.0

  • M. Zampieri, P. Nakov, S. Malmasi, N. Ljubešić, J. Tiedemann, A. Ali (eds.), in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan, June 2019b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1400

  • M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann (eds.), in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland, Aug. 2014a. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-53. https://www.aclweb.org/anthology/W14-5300

  • M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, A report on the DSL shared task 2014, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 58–67, Dublin, Ireland, Aug. 2014b. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5307. https://www.aclweb.org/anthology/W14-5307

  • M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, P. Nakov, Overview of the DSL shared task 2015, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics, pp. 1–9. https://www.aclweb.org/anthology/W15-5401

  • R. Zbib, E. Malchiodi, J. Devlin, D. Stallard, S. Matsoukas, R. Schwartz, J. Makhoul, O. F. Zaidan, C. Callison-Burch, Machine translation of Arabic dialects, in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 49–59, Montréal, Canada, June 2012. Association for Computational Linguistics. https://www.aclweb.org/anthology/N12-1006

  • G.K. Zipf, Selected Studies of the Principle of Relative Frequency in Language (Harvard University Press, Cambridge, MA, 1932)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Jauhiainen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Introduction to Language Identification. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45822-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45821-7

  • Online ISBN: 978-3-031-45822-4

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics