Abstract
In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
I. Adebara, A. Elmadany, M. Abdul-Mageed, A. Inciarte, AfroLID: a neural language identification tool for African languages, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1958–1981, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.128
M. Al-Badrashiny, M. Diab, The George Washington University system for the code-switching workshop shared task 2016, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 108–111, Austin, Texas, USA, Nov. 2016a. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5813. https://aclanthology.org/W16-5813
M. Al-Badrashiny, M. Diab, LILI: a simple language independent approach for language identification, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1211–1219, Osaka, Japan, Dec. 2016b. The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1115
M. Al-Badrashiny, H. Elfardy, M. Diab, AIDA2: a hybrid approach for token and sentence level dialect identification in Arabic, in Proceedings of the 19th Conference on Computational Language Learning, pp. 42–51, Beijing, China (2015)
D. Alfter, Language Segmentation. Master’s thesis, Universität Trier, Trier, Germany (2015)
R. Arun, V. Suresh, C.E. Veni Madhavan, M.N. Narasimha Murthy, On finding the natural number of topics with latent dirichlet allocation: some observations, in Advances in Knowledge Discovery and Data Mining, pp. 391–402, Berlin, Heidelberg, ed. by M.J. Zaki, J.X. Yu, B. Ravindran, V. Pudi (Springer, Berlin, Heidelberg). ISBN 978-3-642-13657-3
T. Baldwin, M. Lui, Multilingual language identification: ALTW 2010 shared task data, in Proceedings of the Australasian Language Technology Association Workshop 2010, pp. 4–7, Melbourne, Australia, Dec. 2010a. https://aclanthology.org/U10-1003
G. Bernier-Colborne, C. Goutte, Challenges in neural language identification: NRC at VarDial 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 273–282, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.26
G. Bernier-Colborne, C. Goutte, S. Léger, Improving cuneiform language identification with BERT, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 17–25, Ann Arbor, Michigan, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1402. https://www.aclweb.org/anthology/W19-1402
G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth workshop on NLP for similar languages, varieties and dialects, pp. 128–134, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.15
C. Biemann, S. Teresniak, Disentangling from Babylonian confusion—unsupervised language identification, in Computational Linguistics and Intelligent Text Processing: 6th International Conference, ed. by A. Gelbukh. CICLing 2005 (Springer, Mexico City, Mexico, 2005), pp. 773–784
V. Bobicev. Discriminating between similar languages using PPM, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 59–65, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://aclanthology.org/W15-5410
R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 627–632, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1069. https://aclanthology.org/D14-1069
R.D. Brown, Selecting and weighting N-grams to identify 1100 languages, in Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD 2013), pp. 475–483, Plzeň, Czech Republic (2013)
R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)
Ç. Çöltekin, Dialect identification under domain shift: Experiments with discriminating Romanian and Moldavian, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 186–192, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.17
I. Caswell, T. Breiner, D. van Esch, A. Bapna, Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus, in Proceedings of the 28th International Conference on Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579. https://www.aclweb.org/anthology/2020.coling-main.579
W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175, Las Vegas, USA (1994)
J. Cazamias, C. Dixit, M. Marek, Large-Scale Language Classification—Writing a Detector for 200 Languages on Twitter. Stanford course report (2015)
A. Ceolin, Comparing the performance of CNNs and shallow models for language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 102–112, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.12
B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–11, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.1
Y.C. Chew, Y. Mikami, R.L. Nagano, Language identification of web pages based on improved N-gram algorithm. Int. J. Comput. Sci. Issues 8(3), 47–58 (2011)
S. Clematide, P. Makarov, CLUZH at VarDial GDI 2017: testing a variety of machine learning tools for the classification of Swiss German dialects, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 170–177, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-1221. https://aclanthology.org/W17-1221
J. Cowie, Y. Ludovik, R. Zacharski, Language recognition for mono- and multi-lingual documents, in Proceedings of the VexTal Conference, pp. 209–214, Venice, Italy (1999)
N. Dongen, Analysis and Prediction of Dutch-English Code-switching in Dutch Social Media Messages. Master’s thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (2017)
S. Dowlagar, R. Mamidi, A pre-trained transformer and CNN model with joint language ID and part-of-speech tagging for code-mixed social-media text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 367–374, Held Online, Sept. 2021. INCOMA Ltd. https://aclanthology.org/2021.ranlp-main.42
A. Dutta, Word-level language identification using subword embeddings for code-mixed Bangla-English social media data, in Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pp. 76–82, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.dclrl-1.10
S. Dutta, T. Saha, S. Banerjee, S.K. Naskar, Text normalization in code-mixed social media text, in 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 378–382, Kolkata, India (2015)
R. Eskander, M. Al-Badrashiny, N. Habash, O. Rambow, Foreign words and the automatic processing of Arabic social media text written in Roman script, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 1–12, Doha, Qatar, 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3901. https://aclanthology.org/W14-3901
M. Gaman, D. Hovy, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, C. Purschke, Y. Scherrer, M. Zampieri, A report on the VarDial evaluation campaign 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–14, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.1
R.R.R. Gangula, R. Mamidi, Addition of Code Mixed Features to Enhance the Sentiment Prediction of Song Lyrics (2018). http://arxiv.org/abs/1806.03821
S. Ghosh, S. Ghosh, D. Das, Labeling of query words using conditional random field, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 31–34, Gandhinagar, India (2015)
O. Giwa, M.H. Davel, Language identification of individual words with joint sequence models, in Proceedings of Interspeech 2014, Singapore (2014)
O. Giwa, M.H. Davel, N-gram based language identification of individual words, in Proceedings of the 24th Annual Symposium of the Pattern Recognition Association of South Africa, pp. 15–22, Johannesburg, South Africa, ed. by P. Robinson (2013)
S. Gundapu and R. Mamidi. Word level language identification in English Telugu code mixed data. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong, 1–3 Dec. 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/Y18-1021
B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification for written language resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May 2006. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf
A. Hussain, M.U. Arshad, An attention based neural network for code switching detection: English & Roman Urdu (2021). arXiv:2103.02252
D. Jain, DA-IICT in FIRE 2015 shared task on mixed script information retrieval, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 53–56, Gandhinagar, India (2015)
T. Jauhiainen, Tekstin kielen automaattinen tunnistaminen. Master’s thesis, University of Helsinki, Helsinki (2010)
T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 178–187, Ann Arbor, Michigan, June 2019c. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1419. https://www.aclweb.org/anthology/W19-1419
T. Jauhiainen, H. Jauhiainen, K. Lindén, Iterative language model adaptation for Indo-Aryan language identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 66–75, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3907
T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-based experiments in Swiss German dialect identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 254–262, Santa Fe, New Mexico, USA, Aug. 2018b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3929
T. Jauhiainen, H. Jauhiainen, K. Lindén, Naive Bayes-based experiments in Romanian dialect identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 76–83, Kyiv, Ukraine, Apr. 2021a. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.9
H. Jauhiainen, T. Jauhiainen, K. Linden, Wanca in Korp: text corpora for under resourced Uralic languages, in Proceedings of the Research data and humanities (RDHUM) 2019 conference, number 17 in Studia Humaniora Ouluensia, ed. by J. Jantunen, S. Brunni, N. Kunnas, S. Palviainen, K. Västi, pp. 21–40, Finland, 2019a. University of Oulu. ISBN 978-952-62-2320-9
T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 173–185, Barcelona, Spain (Online), Dec. 2020c. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.16
T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 183–191, Gothenburg, Sweden, May 2017a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W17-0221
T. Jauhiainen, K. Lindén, H. Jauhiainen, Language Set Identification in Noisy Synthetic Multilingual Documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), pp. 633–643, Cairo, Egypt, 2015c
T. Jauhiainen, K. Lindén, H. Jauhiainen, Language model adaptation for language and dialect identification of text. Nat. Lang. Eng. 25(5), 561–583 (2019)
H. Jhamtani, S.K. Bhogi, V. Raychoudhury, Word-level Language Identification in Bi-lingual Code-switched Texts, in 28th Pacific Asia Conference on Language, Phuket, Thailand. Information and Computation, pp. 348–357 (2014)
L. Kevers, CoSwID, a code switching identification method suitable for under-resourced languages, in Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pp. 112–121, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.sigul-1.15
B. King, S. Abney, Labeling the languages of words in mixed-language documents using weakly supervised methods, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110–1119, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1131
J. King, J. Dehdari, An N-gram Based Language Identification System. The Ohio State University (2008)
L. King, S. Kübler, W. Hooper, Word-level language identification in The Chymistry of Isaac Newton. Digit. Sch. Hum. 30(4), 532–540 (2015)
T. Kocmi, O. Bojar, LanideNN: multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 927–936, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1087
S.B. Kotsiantis, Supervised machine learning: a review of classification techniques. Informatica 31, 249–268 (2007)
S.S.V. Kusampudi, A. Chaluvadi, R. Mamidi, Corpus creation and language identification in low-resource code-mixed Telugu-English text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 744–752, Held Online, Sept. 2021. INCOMA Ltd. https://aclanthology.org/2021.ranlp-main.85
J.D. Lafferty, A. McCallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA (Morgan Kaufmann Publishers Inc., 2001), pp. 282–289. ISBN 1-55860-778-1. http://dl.acm.org/citation.cfm?id=645530.655813
B.S. Lakshmi, B. Shambhavi, An Automatic Language Identification system for code-mixed English-Kannada Social Media Text, in 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), pp. 1–5 (IEEE, 2017)
P. Lamabam, K. Chakma, A language identification system for code-mixed English-Manipuri social media text, in Proceedings of the IEEE International Conference on Engineering and Technology (ICETECH 2016), pp. 79–83, Coimbatore, TN, India (2016)
Y. Li, T. Baldwin, T. Cohn, What’s in a Domain? Learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 474–479, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2076. https://aclanthology.org/N18-2076
C.-C. Lin, W. Ammar, L. Levin, C. Dyer, The CMU submission for the shared task on language identification in code-switched data, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 80–86, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3909. https://aclanthology.org/W14-3909
W. Ling, G. Xiang, C. Dyer, A. Black, I. Trancoso, Microblogs as parallel corpora, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 176–186, Sofia, Bulgaria, Aug. 2013. Association for Computational Linguistics. https://aclanthology.org/P13-1018
N. Ljubešić, F. Klubička, bs, hr, srWaC—web corpora of Bosnian, Croatian and Serbian, in Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35, Gothenburg, Sweden, Apr. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-0405. https://aclanthology.org/W14-0405
Y. Ludovik, R. Zacharski, Multilingual Document Language Recognition for Creating Corpora. Technical report, New Mexico State University (1999)
M. Lui, Generalized Language Identification. Ph.D. thesis, The University of Melbourne (2014)
M. Lui, T. Baldwin, Cross-domain feature selection for language identification, in Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 553–561, Chiang Mai, Thailand, Nov. 2011. Asian Federation of Natural Language Processing. https://aclanthology.org/I11-1062
M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014). 10.1162/tacl_a_00163.aclanthology.org/Q14-1003
M. Lundén, O.L. Schalberg, Dissertatio critico-theologica, de vera indole partium poenitentiae, quam, cons. max. ven. fac. theol. ad Reg. Acad. Aboëns. auctor & praeses mag. Olavus Schalberg, metaphys. & log. prof. reg. & ord. nec non respondens Michaël Lundén, sac. minist. adj. & curam gerens, ad St. Cathar. Publicae bonorum disquisitioni submittunt, die [ ] Novembr. an. MDCCLXXXV, in auditorio majori, h. a. m. c. PhD thesis, Väitöskirja :, Aboae, 1785. http://urn.fi/URN:NBN:fi-fd2014-00005712
M. Mager, Ö. Çetinoğlu, K. Kann, Subword-Level language identification for intra-word code-switching, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 2005–2011, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1201. https://www.aclweb.org/anthology/N19-1201
M. Majliš, Large Multilingual Corpus. Master’s thesis, Charles University in Prague, Prague (2011)
M. Majliš, Yet another language identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 46–54, Avignon, France, Apr. 2012. Association for Computational Linguistics. https://aclanthology.org/E12-3006
S. Malmasi, Open-Set Language Identification (2017). arXiv:1707.04817
S. Malmasi, M. Dras, Language identification using classifier ensembles, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 35–43, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://aclanthology.org/W15-5407
S. Mandal and A. K. Singh. Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 116–120, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6116. https://aclanthology.org/W18-6116
S. Mandal, S. Banerjee, S.K. Naskar, P. Rosso, S. Bandyopadhyay, Adaptive voting in multiple classifier systems for word level language identification, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 49–52, Gandhinagar, India (2015)
T. Mandl, M. Shramko, O. Tartakovski, C. Womser-Hacker, Language identification in multi-lingual web-documents, in Proceedings of the 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006), pp. 153–163, Klagenfurt, Austria (2006)
L.A. Mather, A linear algebra approach to language identification, in Proceedings of the 4th International Workshop Principles of Digital Document Processing (PODDP’98), pp. 92–103, Saint Malo, France (1998)
D. Mave, S. Maharjan, T. Solorio, Language identification and analysis of code-switched social media text, in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 51–61. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-3206
U.F. Mayer, Bootstrapped language identification for multi-site internet domains, in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 579–585, Beijing, China (2012)
A. Minocha, F.M. Tyers, Subsegmental language detection in celtic language text, in Proceedings of the First Celtic Language Techonology Workshop (CLTW 2014), pp. 76–80, Dublin, Ireland (2014)
A. Mishra, Y. Sharma, Language identification and context-based analysis of code-switching behaviors in social media discussions, in 2019 IEEE International Conference on Big Data (Big Data), pp. 5951–5956 (2019). https://doi.org/10.1109/BigData47090.2019.9006032
G. Mohr, M. Stack, I. Rnitovic, D. Avery, M. Kimpton, Introduction to Heritrix, in 4th International Web Archiving Workshop, Bath, UK (2004)
K.N. Murthy, G.B. Kumar, Language identification from small text samples. J. Quant. Linguist. 13(1), 57–80 (2006)
I. Ndubuisi-Obi, S. Ghosh, D. Jurgens, Wetin dey with these comments? Modeling sociolinguistic factors affecting code-switching behavior in Nigerian online discussions, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6204–6214, Florence, Italy, July 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1625. https://aclanthology.org/P19-1625
K. Nelakuditi, D.S. Jitta, R. Mamidi, Part-of-Speech tagging for code mixed English-Telugu Social Media Data, in International Conference on Intelligent Text Processing and Computational Linguistics, pp. 332–342, Springer (2016)
H. Ney, U. Essen, R. Kneser, On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)
L. Nguyen, C. Bryant, S. Kidwai, T. Biberauer, Automatic language identification in code-switched Hindi-English social media text. J. Open Hum. Data 7 (2021)
D. Nguyen, A.S. Doğruöz, Word level language identification in online multilingual communication, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. https://aclanthology.org/D13-1084
G. Ozbek, I. Rosenn, E. Yeh. Language Classification in Multilingual Documents. Technical report, Stanford University (2006)
D. Pelleg, A. Moore, X-means: extending k-means with efficient estimation of the number of clusters, in Proceedings of the 17th International Conference on Machine Learning, vol. 1, pp. 727–734 (2000)
F. Peng, F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562–568, Geneva, Switzerland, Aug. 23–Aug. 27 2004. COLING. https://aclanthology.org/C04-1081
G. Pethö, E. Mózes, An N-gram-based language identification algorithm for variable-length and variable-language texts. Argumentum 10, 56–82 (2014)
A. Phadte, G. Thakkar, Towards normalising Konkani-English code-mixed social media text, in Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pp. 85–94 (2017)
A. Phadte, R. Wagh, Word level language identification system for Konkani-English code-mixed social media text (CMST), in Proceedings of the 10th Annual ACM India Compute Conference, pp. 103–107 (2017)
F. Pla, L.-F. Hurtado, Language identification in twitter: a study case of multiclass and multilabel text classification problem. Int. J. Comput. Linguist. Appl. 6(1), 135–150 (2015)
F. Pla, L.-F. Hurtado, Language identification of multilingual posts from twitter: a case study. Knowl. Inf. Syst. 51(3), 965–989 (2017)
A. Poulston, Z. Waseem, M. Stevenson, Using TF-ID n-gram and word embedding cluster ensembles for author profiling—notebook for PAN at CLEF 2017, in Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland, September 2017, ed. by L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl. CEUR-WS.org. http://ceur-ws.org/Vol-1866/
J.M. Prager, Linguini: language identification for multilingual documents, in Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA (1999)
V. Ramanarayanan, R. Pugh, Automatic token and turn level language identification for code-switched text dialog: an analysis across language pairs and corpora, in Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 80–88, Melbourne, Australia, July 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5009. https://aclanthology.org/W18-5009
P. Rodrigues, Processing Highly Variant Language Using Incremental Model Selection. Ph.D. thesis, Indiana University (2012)
Y. Samih, Dialectal Arabic Processing Using Deep Learning. Ph.D. thesis, Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany (2017)
Y. Samih, S. Maharjan, M. Attia, L. Kallmeyer, T. Solorio, Multilingual code-switching identification via LSTM recurrent neural networks, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5806. https://aclanthology.org/W16-5806
Y. Samih, W. Maier, Detecting code-switching in moroccan Arabic social media, in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2016 IJCAI), New York City, USA (2016)
S. Schulz, M. Keller, Code-switching ubique est—language identification and part-of-speech tagging for historical mixed text, in Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 43–51, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2105. https://aclanthology.org/W16-2105
P. Shrestha, Codeswitching detection via lexical features in conditional random fields, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 121–126, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5816. https://aclanthology.org/W16-5816
U.K. Sikdar, B. Gambäck, Language identification in code-switched text using conditional random fields and babelnet, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 127–131, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5817. https://aclanthology.org/W16-5817
T. Solorio, E. Blair, S. Maharjan, S. Bethard, M. Diab, M. Gohneim, A. Hawwari, F. AlGhamdi, J. Hirschberg, A. Chang, P. Fung, Overview for the first shared task on language identification in code-switched data, in Proceedings of The First Workshop on Computational Approaches to Code Switching, pp. 62–72, Doha, Qatar, Oct. 2014. http://www.aclweb.org/anthology/W14-3907
N.B. Sristy, N.S. Krishna, B.S. Krishna, V. Ravi, Language identification in mixed script, in Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 14–20 (2017)
M. Stupar, T. Jurić, N. Ljubešić, Language identification of web data for Building Linguistic Corpora, in Proceedings of the 3rd International Conference on The Future of Information Sciences (INFuture 2011), pp. 365–372, Zagreb, Croatia (2011)
I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, A language and character set determination method based on \(n\)-gram statistics. ACM Trans. Asian Lang. Inf. Proc. (TALIP) 1(3), 269–278 (2002)
Y. Teh, M. Jordan, M. Beal, D. Blei, Sharing clusters among related groups: hierarchical Dirichlet processes. Advances in Neural Information Processing Systems, vol. 17 (2004)
G.B. Tran, D.B. Nguyen, B.T. Kieu, \(n\)-gram based approach for multilingual language identification. Poster. Technical report, Australasian Language Technology Association Workshop (2010). http://comp.mq.edu.au/programming/task_description/VILangTek.pdf
E. Ullman, Shibboleth—A Multilingual Language Identifier. Master’s thesis, Uppsala University, Uppsala (2014)
T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with N-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf
J. Vogel, D. Tresner-Kirsch, Robust language identification in Short, Noisy Texts: Improvements to LIG, in Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), ed. by M. Atzmueller, H. Andreas pp. 43–50, Bristol, UK (2012)
M. Volk, L. Fischer, P. Scheurer, B.S. Schroffenegger, R. Schwitter, P. Ströbel, B. Suter, Nunc profana tractemus. Detecting code-switching in a large corpus of 16th century letters, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2901–2908, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.311
C. Voss, S. Tratz, J. Laoudi, D. Briesch, Finding romanized Arabic dialect in code-mixed tweets, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 2249–2253, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf
A. Wan, Leveraging data-driven methods in word-level language identification for a Multilingual Alpine Heritage Corpus, in Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pp. 45–54, San Diego, CA, June 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-1206. https://aclanthology.org/W16-1206
N. Wu, E. DeMattos, K. H. So, P.-Z. Chen, Ç. Çöltekin, Language discrimination and transfer learning for similar languages: Experiments with feature combinations and adaptation, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 54–63, Ann Arbor, Michigan, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1406. https://www.aclweb.org/anthology/W19-1406
Xia, M.X., Codeswitching language identification using subword information enriched word vectors, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 132–136, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5818. https://aclanthology.org/W16-5818
M.X. Xia, J.C.K. Cheung, Accurate Pinyin-English codeswitched language identification, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 71–79, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5809. https://aclanthology.org/W16-5809
F. Xia, W. Lewis, H. Poon, Language ID in the context of harvesting language data off the web, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 870–878, Athens, Greece, Mar. 2009. Association for Computational Linguistics. https://aclanthology.org/E09-1099
H. Yamaguchi, K. Tanaka-Ishii, Text segmentation by language using minimum description length, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 969–978, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/P12-1102
Z. Yirmibeşoğlu, G. Eryiğit, Detecting code-switching between Turkish-English language pair, in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 110–115, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6115. https://aclanthology.org/W18-6115
J. Younes, H. Achour, E. Souissi, A. Ferchichi, A deep learning approach for the Romanized Tunisian Dialect identification. Int. Arab J. Inf. Techonol. (IAJIT) 17(6), 935–946 (2020)
M. Zampieri, B.G. Gebre, H. Costa, J. van Genabith, Comparing approaches to the identification of similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 66–72, Hissar, Bulgaria, Sept. 2015a. Association for Computational Linguistics. https://aclanthology.org/W15-5411
M. Zampieri, S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, D. Speelman, A. van den Bosch, R. Kumar, B. Lahiri, M. Jain, Language identification and morphosyntactic tagging: the second VarDial evaluation campaign, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3901
M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, P. Nakov, Overview of the DSL shared task 2015, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5401
W. Zhang, R.A.J. Clark, Y. Wang, W. Li, Unsupervised language identification based on latent dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Large Scale, Multi-domain Language Identification. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-45822-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45821-7
Online ISBN: 978-3-031-45822-4
eBook Packages: Synthesis Collection of Technology (R0)
