Abstract
Language identification (LI) is the task of predicting the language(s) in a text or speech input. The main difference between LI of text and speech is that the characters that make up the text are discrete, whereas with speech, the input is usually a continuous signal. This means that different styles of mathematical methods are needed to process text and speech, traditionally with little methodological overlap between them. In this book, we focus on the language identification of digital text, although we do touch on applications to speech in the case that the speech signal has been translated into a sequence of (discrete) phones. Recognizing the language(s) that a text is written in comes naturally to a human reader familiar with the language(s). Table 1.1 presents excerpts from Wikipedia articles in four different European languages on the topic of Natural Language Processing (NLP), labeled according to the language they are written in. Without referring to the labels, readers of this book will certainly recognize at least one language, and many are likely to identify all of them, even if they can’t read the content in all cases.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We were unable to obtain the original article, so our account of the paper is based on the abstract and reports in later published articles.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
Optical Character Recognition (OCR).
- 12.
- 13.
- 14.
References
W. Adouane, Automatic Detection of Under resourced languages: Dialectal Arabic short texts. Master’s thesis, University of Gothenburg, Gothenburg, Sweden (2016)
B. Alex, Automatic Detection of English Inclusions in Mixed-lingual Data with an Application to Parsing. Ph.D. thesis, The University of Edinburgh (2008)
T. Baldwin, M. Lui, Language identification: the long and the short of the matter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237, Los Angeles, CA, USA, June 2010b. Association for Computational Linguistics. https://aclanthology.org/N10-1027
K.R. Beesley, Language identifier: a computer program for automatic natural-language identification of on-line text, in Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads, pp. 47–54, Seattle, USA (1988)
S. Bergsma, P. McNamee, M. Bagdouri, C. Fink, T. Wilson, Language identification for creating language-specific twitter collections, in Proceedings of the Second Workshop on Language in Social Media (LSM2012), pp. 65–74, Montréal, Canada (2012)
G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 128–134, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.15
S.L. Blodgett, J. Wei, B. O’Connor, A dataset and classifier for recognizing social media English, in Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 56–61, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-4408. https://aclanthology.org/W17-4408
V. Bobicev, Discriminating between similar languages using PPM, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 59–65, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://aclanthology.org/W15-5410
W. Bright, Notes. Lang. Soc. 26(3), 469–470 (1997). https://doi.org/10.1017/S0047404500019679
R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 627–632, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1069. https://aclanthology.org/D14-1069
R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)
I. Caswell, T. Breiner, D. van Esch, A. Bapna, Language ID in the wild: unexpected challenges on the path to a thousand-language web text corpus, in Proceedings of the 28th International Conference on Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579. https://www.aclweb.org/anthology/2020.coling-main.579
W.B. Cavnar, J.M. Trenkle, N-Gram-Based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175, Las Vegas, USA (1994)
B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–11, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.1
K. Church, Stress assignment in letter to sound rules for speech synthesis, in 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246–253, Chicago, Illinois, USA, July 1985. Association for Computational Linguistics. https://doi.org/10.3115/981210.981240. https://aclanthology.org/P85-1030
M. Clyne, Pluricentric Languages: Different Norms in Different Nations (CRC Press, Boca Raton, USA, 1992)
F. Debole, F. Sebastiani, An analysis of the relative hardness of Reuters-21578 subsets. J. Am. Soc. Inf. Sci. Technol. 56(6), 584–596 (2005)
A. Elnagar, S.M. Yagi, A.B. Nassif, I. Shahin, S.A. Salloum, Systematic literature review of Dialectal Arabic: identification and detection. IEEE Access 9, 31010–31042 (2021). https://doi.org/10.1109/ACCESS.2021.3059504
R. Eskander, M. Al-Badrashiny, N. Habash, O. Rambow, Foreign words and the automatic processing of Arabic social media text written in Roman script, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 1–12, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3901. https://aclanthology.org/W14-3901
M. Gaman, D. Hovy, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, C. Purschke, Y. Scherrer, M. Zampieri, A Report on the VarDial Evaluation Campaign 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–14, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.1
E.M. Gold, Language identification in the limit. Inf. Control 10(5), 447–474 (1967)
C. Goutte, S. Léger, M. Carpuat, The NRC system for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 139–145, Dublin, Ireland, Aug. 2014. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5316. https://aclanthology.org/W14-5316
G. Grefenstette, Comparing two language identification schemes, in Proceedings of the 3rd International conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (1995)
H. Hammarström, A fine-grained model for language identification, in Proceedings of Improving Non English Web Searching (iNEWS-07) Workshop at SIGIR 2007, pp. 14–20, Amsterdam, Netherlands (2007)
L. Hinguruduwa, E. Marx, T. Soru, T. Riechert, Assessing language identification over DBpedia, in 2021 IEEE 15th International Conference on Semantic Computing (ICSC), pp. 296–297 (2021). https://doi.org/10.1109/ICSC50631.2021.00084
A.S. House, E.P. Neuburg, Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)
B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification for written language resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May 2006. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf
N. Ingle. A language Identification Table. Technical Translation International (1980)
T. Jauhiainen, Language identification in texts. Ph.D. thesis, University of Helsinki, Finland (2019)
H. Jauhiainen, T. Jauhiainen, K. Lindén, Building Web Corpora for Minority Languages, in Proceedings of the 12th Web as Corpus Workshop, pp. 23–32, Marseille, France, May 2020a. European Language Resources Association. ISBN 979-10-95546-68-9. https://www.aclweb.org/anthology/2020.wac-1.4
T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating similar languages with token-based backoff, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 44–51, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5408
H. Jauhiainen, T. Jauhiainen, K. Linden, Wanca in Korp: text corpora for under resourced Uralic languages, in Proceedings of the Research data and humanities (RDHUM) 2019 conference, number 17 in Studia Humaniora Ouluensia, ed. by J. Jantunen, S. Brunni, N. Kunnas, S. Palviainen, K. Västi, pp. 21–40, Finland. University of Oulu (2019a). ISBN 978-952-62-2320-9
T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 173–185, Barcelona, Spain (Online), Dec. 2020c. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.16
T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 183–191, Gothenburg, Sweden, May 2017a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W17-0221
T. Jauhiainen, K. Lindén, H. Jauhiainen, Language set identification in noisy synthetic multilingual documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), pp. 633–643, Cairo, Egypt (2015c)
T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, K. Lindén, Automatic language identification in texts: a survey. J. Artif. Intell. Res. 65, 675–782 (2019e). ISSN 1076-9757. https://doi.org/10.1613/jair.1.11675
B. King, S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110–1119, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1131
B. King, D. Radev, S. Abney, Experiments in Sentence Language Identification with Groups of Similar Languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 146–154, Dublin, Ireland, Aug. 2014a. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5317. https://aclanthology.org/W14-5317
T. Kocmi, O. Bojar, LanideNN: multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 927–936, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1087
A. Kralisch, T. Mandl, Barriers to information access across languages on the internet: network and language effects, in Proceedings of the 39th Annual Hawaii International Conference on System Sciences, vol. 3, p. 54b, Kauai, USA (2006)
Y. Li, T. Baldwin, T. Cohn, What’s in a Domain? learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 474–479, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2076. https://aclanthology.org/N18-2076
N. Ljubešić, D. Kranjcić, Discriminating between closely related languages on twitter. Informatica 39 (2015)
N. Ljubešić, D. Kranjcić, Discriminating between VERY Similar Languages among Twitter Users, in Proceedings of the 9th Language Technologies Conference, pp. 90–94, Ljubljana, Slovenia (2014)
N. Ljubešić, A. Toral, caWaC—a web corpus of Catalan and its application to language modeling and machine translation, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1728–1732, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/841_Paper.pdf
E. Loginova, S. Varanasi, G. Neumann, Towards end-to-end multilingual question answering. Inf. Syst. Front. 23(1), 227–241 (2021)
M. Lui, T. Baldwin, Accurate language identification of Twitter messages, in Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pp. 17–25, Gothenburg, Sweden, Apr. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-1303. https://aclanthology.org/W14-1303
M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014)
M. Majliš, Yet Another Language Identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 46–54, Avignon, France, Apr. 2012. Association for Computational Linguistics. https://aclanthology.org/E12-3006
S. Malmasi, Open-Set language identification (2017). arXiv:1707.04817
S. Malmasi, M. Dras, Automatic language identification for Persian and Dari Texts, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, pp. 59–64, Bali, Indonesia (2015a)
S. Malmasi, E. Refaee, M. Dras, Arabic dialect identification using a parallel multidialectal corpus, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, pp. 209–217, Bali, Indonesia (2015)
S. Malmasi, M. Zampieri, Arabic dialect identification in speech transcripts, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 106–113, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://aclanthology.org/W16-4814
S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14, Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://www.aclweb.org/anthology/W16-4801
P. McNamee, Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)
E. Miller, An introduction to the resource description framework. Bull. Am. Soc. Inf. Sci. Technol. 25(1), 15–19 (1998)
G. Mohr, M. Stack, I. Rnitovic, D. Avery, M. Kimpton, Introduction to Heritrix, in 4th International Web Archiving Workshop, Bath, UK (2004)
S. Mustonen, Multiple discriminant analysis in linguistic problems. Stat. Methods Linguist. 4, 37–44 (1965)
Y. Nakamura, Identification of languages with short sample texts—a linguometric study. Libr. Inf. Sci. 9, 459–481 (1971)
P. Nakov, M. Zampieri, N. Ljubešić, J. Tiedemann, S. Malmasi, A. Ali (eds.), in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-12. https://www.aclweb.org/anthology/W17-1200
P. Nakov, M. Zampieri, P. Osenova, L. Tan, C. Vertan, N. Ljubešić, J. Tiedemann (eds.), in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5400
P. Nakov, M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, S. Malmasi (eds.), in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, Dec. 2016. The COLING 2016 Organizing Committee. https://www.aclweb.org/anthology/W16-4800
P. Newman, Foreign Language Identification: First Step in the Translation Process. Technical report, Sandia National Labs., Albuquerque, NM (USA) (1987)
A. Patwari, N. Kong, J. Wang, U. Gargi, Y. Music, M. Covell, A. Jansen, Semantically meaningful attributes from co-listen embeddings for playlist exploration and expansion, in Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR (2020)
J. Porta, J.-L. Sancho, Using maximum entropy models to discriminate between similar languages and varieties, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 120–128, Dublin, Ireland, Aug. 2014. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5314. https://aclanthology.org/W14-5314
K.A. Rafidha Rehiman, A.S. Keerthy, K.S. Lakshmi, A. Sreekumar, A language identification and conversion system for Malayalam to ensure security, in 3rd National Conference on Indian Language Computing (NCILC 2013), Cochin, Kerala, India (2013)
M.D. Rau, Language Identification by Statistical Analysis. Master’s thesis, Naval Postgraduate School, Monterey (1974)
P. Rodrigues, Processing Highly Variant Language Using Incremental Model Selection. Ph.D. thesis, Indiana University (2012)
N.C. Rowe, R. Schwamm, S.L. Garfinkel, Language translation for file paths. Digit. Investig. 10 (2013)
F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
G.F. Simons, C.D. Fennig (eds.), Ethnologue: Languages of the World, 21st ed. (SIL International, Dallas, Texas, 2018). http://www.ethnologue.com
S.C. Tratz, Accurate Arabic Script Language/Dialect Classification Technical report, Army Research Laboratory (2014)
G. van Noord, TextCat (1997). http://odur.let.rug.nl/~vannoord/TextCat/
M. van der Wees, A. Bisazza, W. Weerkamp, C. Monz, What’s in a Domain? Analyzing genre and topic differences in statistical machine translation, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 560–566, Beijing, China, July 2015. Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-2092. https://aclanthology.org/P15-2092
T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with N-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf
J. Vogel, D. Tresner-Kirsch, Robust language identification in short, Noisy Texts: improvements to LIGA, in Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), ed. by M. Atzmueller, H. Andreas, pp. 43–50, Bristol, UK (2012)
C. Voss, S. Tratz, J. Laoudi, D. Briesch, Finding Romanized Arabic dialect in code-mixed tweets, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 2249–2253, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf
F. Xia, W. Lewis, H. Poon, Language ID in the context of harvesting language data off the web, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 870–878, Athens, Greece, Mar. 2009. Association for Computational Linguistics. https://aclanthology.org/E09-1099
O.F. Zaidan, C. Callison-Burch, Arabic dialect identification. Comput. Linguist. 40(1), 171–202 (2014)
M. Zampieri, B.G. Gebre, Automatic identification of language varieties: the case of portuguese, in Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), pp. 233–237, Vienna, Austria (2012)
M. Zampieri, S. Malmasi, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Y. Scherrer, N. Aepli, Findings of the VarDial evaluation campaign 2017, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 1–15, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-1201. https://www.aclweb.org/anthology/W17-1201
M. Zampieri, S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, D. Speelman, A. van den Bosch, R. Kumar, B. Lahiri, M. Jain, Language identification and morphosyntactic tagging: the second VarDial evaluation campaign, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3901
M. Zampieri, S. Malmasi, Y. Scherrer, T. Samardžić, F. Tyers, M. Silfverberg, N. Klyueva, T.-L. Pan, C.-R. Huang, R.T. Ionescu, A.M. Butnaru, T. Jauhiainen, A report on the third VarDial evaluation campaign, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–16, Ann Arbor, Michigan, June 2019a. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1401. https://www.aclweb.org/anthology/W19-1401
M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, S. Malmasi, A. Ali (eds.), Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, Aug. 2018b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3900
M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, Y. Scherrer (eds.), in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.0
M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, Y. Scherrer, T. Jauhiainen (eds.), in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://aclanthology.org/2021.vardial-1.0
M. Zampieri, P. Nakov, S. Malmasi, N. Ljubešić, J. Tiedemann, A. Ali (eds.), in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan, June 2019b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1400
M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann (eds.), in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland, Aug. 2014a. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-53. https://www.aclweb.org/anthology/W14-5300
M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, A report on the DSL shared task 2014, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 58–67, Dublin, Ireland, Aug. 2014b. Association for Computational Linguistics and Dublin City University. https://doi.org/10.3115/v1/W14-5307. https://www.aclweb.org/anthology/W14-5307
M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, P. Nakov, Overview of the DSL shared task 2015, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics, pp. 1–9. https://www.aclweb.org/anthology/W15-5401
R. Zbib, E. Malchiodi, J. Devlin, D. Stallard, S. Matsoukas, R. Schwartz, J. Makhoul, O. F. Zaidan, C. Callison-Burch, Machine translation of Arabic dialects, in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 49–59, Montréal, Canada, June 2012. Association for Computational Linguistics. https://www.aclweb.org/anthology/N12-1006
G.K. Zipf, Selected Studies of the Principle of Relative Frequency in Language (Harvard University Press, Cambridge, MA, 1932)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Introduction to Language Identification. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-45822-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45821-7
Online ISBN: 978-3-031-45822-4
eBook Packages: Synthesis Collection of Technology (R0)
