Skip to main content

Large Scale, Multi-domain Language Identification

  • Chapter
  • First Online:
  • 87 Accesses

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

Abstract

In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   32.09
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR   43.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://taku910.github.io/crfpp/.

  2. 2.

    https://pypi.org/project/python-crfsuite/.

References

  • I. Adebara, A. Elmadany, M. Abdul-Mageed, A. Inciarte, AfroLID: a neural language identification tool for African languages, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1958–1981, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.128

  • M. Al-Badrashiny, M. Diab, The George Washington University system for the code-switching workshop shared task 2016, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 108–111, Austin, Texas, USA, Nov. 2016a. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5813. https://aclanthology.org/W16-5813

  • M. Al-Badrashiny, M. Diab, LILI: a simple language independent approach for language identification, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1211–1219, Osaka, Japan, Dec. 2016b. The COLING 2016 Organizing Committee. https://aclanthology.org/C16-1115

  • M. Al-Badrashiny, H. Elfardy, M. Diab, AIDA2: a hybrid approach for token and sentence level dialect identification in Arabic, in Proceedings of the 19th Conference on Computational Language Learning, pp. 42–51, Beijing, China (2015)

    Google Scholar 

  • D. Alfter, Language Segmentation. Master’s thesis, Universität Trier, Trier, Germany (2015)

    Google Scholar 

  • R. Arun, V. Suresh, C.E. Veni Madhavan, M.N. Narasimha Murthy, On finding the natural number of topics with latent dirichlet allocation: some observations, in Advances in Knowledge Discovery and Data Mining, pp. 391–402, Berlin, Heidelberg, ed. by M.J. Zaki, J.X. Yu, B. Ravindran, V. Pudi (Springer, Berlin, Heidelberg). ISBN 978-3-642-13657-3

    Google Scholar 

  • T. Baldwin, M. Lui, Multilingual language identification: ALTW 2010 shared task data, in Proceedings of the Australasian Language Technology Association Workshop 2010, pp. 4–7, Melbourne, Australia, Dec. 2010a. https://aclanthology.org/U10-1003

  • G. Bernier-Colborne, C. Goutte, Challenges in neural language identification: NRC at VarDial 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 273–282, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.26

  • G. Bernier-Colborne, C. Goutte, S. Léger, Improving cuneiform language identification with BERT, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 17–25, Ann Arbor, Michigan, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1402. https://www.aclweb.org/anthology/W19-1402

  • G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth workshop on NLP for similar languages, varieties and dialects, pp. 128–134, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.15

  • C. Biemann, S. Teresniak, Disentangling from Babylonian confusion—unsupervised language identification, in Computational Linguistics and Intelligent Text Processing: 6th International Conference, ed. by A. Gelbukh. CICLing 2005 (Springer, Mexico City, Mexico, 2005), pp. 773–784

    Google Scholar 

  • V. Bobicev. Discriminating between similar languages using PPM, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 59–65, Hissar, Bulgaria, Sept. 2015. Association for Computational Linguistics. https://aclanthology.org/W15-5410

  • R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 627–632, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1069. https://aclanthology.org/D14-1069

  • R.D. Brown, Selecting and weighting N-grams to identify 1100 languages, in Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD 2013), pp. 475–483, Plzeň, Czech Republic (2013)

    Google Scholar 

  • R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)

    Article  Google Scholar 

  • Ç. Çöltekin, Dialect identification under domain shift: Experiments with discriminating Romanian and Moldavian, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 186–192, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.17

  • I. Caswell, T. Breiner, D. van Esch, A. Bapna, Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus, in Proceedings of the 28th International Conference on Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.579. https://www.aclweb.org/anthology/2020.coling-main.579

  • W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175, Las Vegas, USA (1994)

    Google Scholar 

  • J. Cazamias, C. Dixit, M. Marek, Large-Scale Language Classification—Writing a Detector for 200 Languages on Twitter. Stanford course report (2015)

    Google Scholar 

  • A. Ceolin, Comparing the performance of CNNs and shallow models for language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 102–112, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.12

  • B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–11, Kyiv, Ukraine, Apr. 2021. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.1

  • Y.C. Chew, Y. Mikami, R.L. Nagano, Language identification of web pages based on improved N-gram algorithm. Int. J. Comput. Sci. Issues 8(3), 47–58 (2011)

    Google Scholar 

  • S. Clematide, P. Makarov, CLUZH at VarDial GDI 2017: testing a variety of machine learning tools for the classification of Swiss German dialects, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 170–177, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-1221. https://aclanthology.org/W17-1221

  • J. Cowie, Y. Ludovik, R. Zacharski, Language recognition for mono- and multi-lingual documents, in Proceedings of the VexTal Conference, pp. 209–214, Venice, Italy (1999)

    Google Scholar 

  • N. Dongen, Analysis and Prediction of Dutch-English Code-switching in Dutch Social Media Messages. Master’s thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (2017)

    Google Scholar 

  • S. Dowlagar, R. Mamidi, A pre-trained transformer and CNN model with joint language ID and part-of-speech tagging for code-mixed social-media text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 367–374, Held Online, Sept. 2021. INCOMA Ltd. https://aclanthology.org/2021.ranlp-main.42

  • A. Dutta, Word-level language identification using subword embeddings for code-mixed Bangla-English social media data, in Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pp. 76–82, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.dclrl-1.10

  • S. Dutta, T. Saha, S. Banerjee, S.K. Naskar, Text normalization in code-mixed social media text, in 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 378–382, Kolkata, India (2015)

    Google Scholar 

  • R. Eskander, M. Al-Badrashiny, N. Habash, O. Rambow, Foreign words and the automatic processing of Arabic social media text written in Roman script, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 1–12, Doha, Qatar, 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3901. https://aclanthology.org/W14-3901

  • M. Gaman, D. Hovy, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, C. Purschke, Y. Scherrer, M. Zampieri, A report on the VarDial evaluation campaign 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–14, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.1

  • R.R.R. Gangula, R. Mamidi, Addition of Code Mixed Features to Enhance the Sentiment Prediction of Song Lyrics (2018). http://arxiv.org/abs/1806.03821

  • S. Ghosh, S. Ghosh, D. Das, Labeling of query words using conditional random field, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 31–34, Gandhinagar, India (2015)

    Google Scholar 

  • O. Giwa, M.H. Davel, Language identification of individual words with joint sequence models, in Proceedings of Interspeech 2014, Singapore (2014)

    Google Scholar 

  • O. Giwa, M.H. Davel, N-gram based language identification of individual words, in Proceedings of the 24th Annual Symposium of the Pattern Recognition Association of South Africa, pp. 15–22, Johannesburg, South Africa, ed. by P. Robinson (2013)

    Google Scholar 

  • S. Gundapu and R. Mamidi. Word level language identification in English Telugu code mixed data. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong, 1–3 Dec. 2018. Association for Computational Linguistics. https://www.aclweb.org/anthology/Y18-1021

  • B. Hughes, T. Baldwin, S. Bird, J. Nicholson, A. MacKinlay, Reconsidering language identification for written language resources, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May 2006. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/459_pdf.pdf

  • A. Hussain, M.U. Arshad, An attention based neural network for code switching detection: English & Roman Urdu (2021). arXiv:2103.02252

  • D. Jain, DA-IICT in FIRE 2015 shared task on mixed script information retrieval, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 53–56, Gandhinagar, India (2015)

    Google Scholar 

  • T. Jauhiainen, Tekstin kielen automaattinen tunnistaminen. Master’s thesis, University of Helsinki, Helsinki (2010)

    Google Scholar 

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 178–187, Ann Arbor, Michigan, June 2019c. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1419. https://www.aclweb.org/anthology/W19-1419

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Iterative language model adaptation for Indo-Aryan language identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 66–75, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3907

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-based experiments in Swiss German dialect identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 254–262, Santa Fe, New Mexico, USA, Aug. 2018b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3929

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Naive Bayes-based experiments in Romanian dialect identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 76–83, Kyiv, Ukraine, Apr. 2021a. Association for Computational Linguistics. https://www.aclweb.org/anthology/2021.vardial-1.9

  • H. Jauhiainen, T. Jauhiainen, K. Linden, Wanca in Korp: text corpora for under resourced Uralic languages, in Proceedings of the Research data and humanities (RDHUM) 2019 conference, number 17 in Studia Humaniora Ouluensia, ed. by J. Jantunen, S. Brunni, N. Kunnas, S. Palviainen, K. Västi, pp. 21–40, Finland, 2019a. University of Oulu. ISBN 978-952-62-2320-9

    Google Scholar 

  • T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 173–185, Barcelona, Spain (Online), Dec. 2020c. International Committee on Computational Linguistics (ICCL). https://www.aclweb.org/anthology/2020.vardial-1.16

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 183–191, Gothenburg, Sweden, May 2017a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W17-0221

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Language Set Identification in Noisy Synthetic Multilingual Documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), pp. 633–643, Cairo, Egypt, 2015c

    Google Scholar 

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Language model adaptation for language and dialect identification of text. Nat. Lang. Eng. 25(5), 561–583 (2019)

    Article  Google Scholar 

  • H. Jhamtani, S.K. Bhogi, V. Raychoudhury, Word-level Language Identification in Bi-lingual Code-switched Texts, in 28th Pacific Asia Conference on Language, Phuket, Thailand. Information and Computation, pp. 348–357 (2014)

    Google Scholar 

  • L. Kevers, CoSwID, a code switching identification method suitable for under-resourced languages, in Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pp. 112–121, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.sigul-1.15

  • B. King, S. Abney, Labeling the languages of words in mixed-language documents using weakly supervised methods, in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1110–1119, Atlanta, Georgia, June 2013. Association for Computational Linguistics. https://aclanthology.org/N13-1131

  • J. King, J. Dehdari, An N-gram Based Language Identification System. The Ohio State University (2008)

    Google Scholar 

  • L. King, S. Kübler, W. Hooper, Word-level language identification in The Chymistry of Isaac Newton. Digit. Sch. Hum. 30(4), 532–540 (2015)

    Google Scholar 

  • T. Kocmi, O. Bojar, LanideNN: multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 927–936, Valencia, Spain, Apr. 2017. Association for Computational Linguistics. https://aclanthology.org/E17-1087

  • S.B. Kotsiantis, Supervised machine learning: a review of classification techniques. Informatica 31, 249–268 (2007)

    MathSciNet  Google Scholar 

  • S.S.V. Kusampudi, A. Chaluvadi, R. Mamidi, Corpus creation and language identification in low-resource code-mixed Telugu-English text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 744–752, Held Online, Sept. 2021. INCOMA Ltd. https://aclanthology.org/2021.ranlp-main.85

  • J.D. Lafferty, A. McCallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA (Morgan Kaufmann Publishers Inc., 2001), pp. 282–289. ISBN 1-55860-778-1. http://dl.acm.org/citation.cfm?id=645530.655813

  • B.S. Lakshmi, B. Shambhavi, An Automatic Language Identification system for code-mixed English-Kannada Social Media Text, in 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), pp. 1–5 (IEEE, 2017)

    Google Scholar 

  • P. Lamabam, K. Chakma, A language identification system for code-mixed English-Manipuri social media text, in Proceedings of the IEEE International Conference on Engineering and Technology (ICETECH 2016), pp. 79–83, Coimbatore, TN, India (2016)

    Google Scholar 

  • Y. Li, T. Baldwin, T. Cohn, What’s in a Domain? Learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 474–479, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2076. https://aclanthology.org/N18-2076

  • C.-C. Lin, W. Ammar, L. Levin, C. Dyer, The CMU submission for the shared task on language identification in code-switched data, in Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 80–86, Doha, Qatar, Oct. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-3909. https://aclanthology.org/W14-3909

  • W. Ling, G. Xiang, C. Dyer, A. Black, I. Trancoso, Microblogs as parallel corpora, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 176–186, Sofia, Bulgaria, Aug. 2013. Association for Computational Linguistics. https://aclanthology.org/P13-1018

  • N. Ljubešić, F. Klubička, bs, hr, srWaC—web corpora of Bosnian, Croatian and Serbian, in Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35, Gothenburg, Sweden, Apr. 2014. Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-0405. https://aclanthology.org/W14-0405

  • Y. Ludovik, R. Zacharski, Multilingual Document Language Recognition for Creating Corpora. Technical report, New Mexico State University (1999)

    Google Scholar 

  • M. Lui, Generalized Language Identification. Ph.D. thesis, The University of Melbourne (2014)

    Google Scholar 

  • M. Lui, T. Baldwin, Cross-domain feature selection for language identification, in Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 553–561, Chiang Mai, Thailand, Nov. 2011. Asian Federation of Natural Language Processing. https://aclanthology.org/I11-1062

  • M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014). 10.1162/tacl_a_00163.aclanthology.org/Q14-1003

    Google Scholar 

  • M. Lundén, O.L. Schalberg, Dissertatio critico-theologica, de vera indole partium poenitentiae, quam, cons. max. ven. fac. theol. ad Reg. Acad. Aboëns. auctor & praeses mag. Olavus Schalberg, metaphys. & log. prof. reg. & ord. nec non respondens Michaël Lundén, sac. minist. adj. & curam gerens, ad St. Cathar. Publicae bonorum disquisitioni submittunt, die [ ] Novembr. an. MDCCLXXXV, in auditorio majori, h. a. m. c. PhD thesis, Väitöskirja :, Aboae, 1785. http://urn.fi/URN:NBN:fi-fd2014-00005712

  • M. Mager, Ö. Çetinoğlu, K. Kann, Subword-Level language identification for intra-word code-switching, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 2005–2011, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1201. https://www.aclweb.org/anthology/N19-1201

  • M. Majliš, Large Multilingual Corpus. Master’s thesis, Charles University in Prague, Prague (2011)

    Google Scholar 

  • M. Majliš, Yet another language identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 46–54, Avignon, France, Apr. 2012. Association for Computational Linguistics. https://aclanthology.org/E12-3006

  • S. Malmasi, Open-Set Language Identification (2017). arXiv:1707.04817

  • S. Malmasi, M. Dras, Language identification using classifier ensembles, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 35–43, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://aclanthology.org/W15-5407

  • S. Mandal and A. K. Singh. Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 116–120, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6116. https://aclanthology.org/W18-6116

  • S. Mandal, S. Banerjee, S.K. Naskar, P. Rosso, S. Bandyopadhyay, Adaptive voting in multiple classifier systems for word level language identification, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015), pp. 49–52, Gandhinagar, India (2015)

    Google Scholar 

  • T. Mandl, M. Shramko, O. Tartakovski, C. Womser-Hacker, Language identification in multi-lingual web-documents, in Proceedings of the 11th International Conference on Applications of Natural Language to Information Systems (NLDB 2006), pp. 153–163, Klagenfurt, Austria (2006)

    Google Scholar 

  • L.A. Mather, A linear algebra approach to language identification, in Proceedings of the 4th International Workshop Principles of Digital Document Processing (PODDP’98), pp. 92–103, Saint Malo, France (1998)

    Google Scholar 

  • D. Mave, S. Maharjan, T. Solorio, Language identification and analysis of code-switched social media text, in Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pp. 51–61. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-3206

  • U.F. Mayer, Bootstrapped language identification for multi-site internet domains, in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 579–585, Beijing, China (2012)

    Google Scholar 

  • A. Minocha, F.M. Tyers, Subsegmental language detection in celtic language text, in Proceedings of the First Celtic Language Techonology Workshop (CLTW 2014), pp. 76–80, Dublin, Ireland (2014)

    Google Scholar 

  • A. Mishra, Y. Sharma, Language identification and context-based analysis of code-switching behaviors in social media discussions, in 2019 IEEE International Conference on Big Data (Big Data), pp. 5951–5956 (2019). https://doi.org/10.1109/BigData47090.2019.9006032

  • G. Mohr, M. Stack, I. Rnitovic, D. Avery, M. Kimpton, Introduction to Heritrix, in 4th International Web Archiving Workshop, Bath, UK (2004)

    Google Scholar 

  • K.N. Murthy, G.B. Kumar, Language identification from small text samples. J. Quant. Linguist. 13(1), 57–80 (2006)

    Article  Google Scholar 

  • I. Ndubuisi-Obi, S. Ghosh, D. Jurgens, Wetin dey with these comments? Modeling sociolinguistic factors affecting code-switching behavior in Nigerian online discussions, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6204–6214, Florence, Italy, July 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1625. https://aclanthology.org/P19-1625

  • K. Nelakuditi, D.S. Jitta, R. Mamidi, Part-of-Speech tagging for code mixed English-Telugu Social Media Data, in International Conference on Intelligent Text Processing and Computational Linguistics, pp. 332–342, Springer (2016)

    Google Scholar 

  • H. Ney, U. Essen, R. Kneser, On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)

    Article  Google Scholar 

  • L. Nguyen, C. Bryant, S. Kidwai, T. Biberauer, Automatic language identification in code-switched Hindi-English social media text. J. Open Hum. Data 7 (2021)

    Google Scholar 

  • D. Nguyen, A.S. Doğruöz, Word level language identification in online multilingual communication, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 857–862, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. https://aclanthology.org/D13-1084

  • G. Ozbek, I. Rosenn, E. Yeh. Language Classification in Multilingual Documents. Technical report, Stanford University (2006)

    Google Scholar 

  • D. Pelleg, A. Moore, X-means: extending k-means with efficient estimation of the number of clusters, in Proceedings of the 17th International Conference on Machine Learning, vol. 1, pp. 727–734 (2000)

    Google Scholar 

  • F. Peng, F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562–568, Geneva, Switzerland, Aug. 23–Aug. 27 2004. COLING. https://aclanthology.org/C04-1081

  • G. Pethö, E. Mózes, An N-gram-based language identification algorithm for variable-length and variable-language texts. Argumentum 10, 56–82 (2014)

    Google Scholar 

  • A. Phadte, G. Thakkar, Towards normalising Konkani-English code-mixed social media text, in Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pp. 85–94 (2017)

    Google Scholar 

  • A. Phadte, R. Wagh, Word level language identification system for Konkani-English code-mixed social media text (CMST), in Proceedings of the 10th Annual ACM India Compute Conference, pp. 103–107 (2017)

    Google Scholar 

  • F. Pla, L.-F. Hurtado, Language identification in twitter: a study case of multiclass and multilabel text classification problem. Int. J. Comput. Linguist. Appl. 6(1), 135–150 (2015)

    Google Scholar 

  • F. Pla, L.-F. Hurtado, Language identification of multilingual posts from twitter: a case study. Knowl. Inf. Syst. 51(3), 965–989 (2017)

    Article  Google Scholar 

  • A. Poulston, Z. Waseem, M. Stevenson, Using TF-ID n-gram and word embedding cluster ensembles for author profiling—notebook for PAN at CLEF 2017, in Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland, September 2017, ed. by L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl. CEUR-WS.org. http://ceur-ws.org/Vol-1866/

  • J.M. Prager, Linguini: language identification for multilingual documents, in Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA (1999)

    Google Scholar 

  • V. Ramanarayanan, R. Pugh, Automatic token and turn level language identification for code-switched text dialog: an analysis across language pairs and corpora, in Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 80–88, Melbourne, Australia, July 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-5009. https://aclanthology.org/W18-5009

  • P. Rodrigues, Processing Highly Variant Language Using Incremental Model Selection. Ph.D. thesis, Indiana University (2012)

    Google Scholar 

  • Y. Samih, Dialectal Arabic Processing Using Deep Learning. Ph.D. thesis, Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany (2017)

    Google Scholar 

  • Y. Samih, S. Maharjan, M. Attia, L. Kallmeyer, T. Solorio, Multilingual code-switching identification via LSTM recurrent neural networks, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5806. https://aclanthology.org/W16-5806

  • Y. Samih, W. Maier, Detecting code-switching in moroccan Arabic social media, in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2016 IJCAI), New York City, USA (2016)

    Google Scholar 

  • S. Schulz, M. Keller, Code-switching ubique est—language identification and part-of-speech tagging for historical mixed text, in Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 43–51, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2105. https://aclanthology.org/W16-2105

  • P. Shrestha, Codeswitching detection via lexical features in conditional random fields, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 121–126, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5816. https://aclanthology.org/W16-5816

  • U.K. Sikdar, B. Gambäck, Language identification in code-switched text using conditional random fields and babelnet, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 127–131, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5817. https://aclanthology.org/W16-5817

  • T. Solorio, E. Blair, S. Maharjan, S. Bethard, M. Diab, M. Gohneim, A. Hawwari, F. AlGhamdi, J. Hirschberg, A. Chang, P. Fung, Overview for the first shared task on language identification in code-switched data, in Proceedings of The First Workshop on Computational Approaches to Code Switching, pp. 62–72, Doha, Qatar, Oct. 2014. http://www.aclweb.org/anthology/W14-3907

  • N.B. Sristy, N.S. Krishna, B.S. Krishna, V. Ravi, Language identification in mixed script, in Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 14–20 (2017)

    Google Scholar 

  • M. Stupar, T. Jurić, N. Ljubešić, Language identification of web data for Building Linguistic Corpora, in Proceedings of the 3rd International Conference on The Future of Information Sciences (INFuture 2011), pp. 365–372, Zagreb, Croatia (2011)

    Google Scholar 

  • I. Suzuki, Y. Mikami, A. Ohsato, Y. Chubachi, A language and character set determination method based on \(n\)-gram statistics. ACM Trans. Asian Lang. Inf. Proc. (TALIP) 1(3), 269–278 (2002)

    Article  Google Scholar 

  • Y. Teh, M. Jordan, M. Beal, D. Blei, Sharing clusters among related groups: hierarchical Dirichlet processes. Advances in Neural Information Processing Systems, vol. 17 (2004)

    Google Scholar 

  • G.B. Tran, D.B. Nguyen, B.T. Kieu, \(n\)-gram based approach for multilingual language identification. Poster. Technical report, Australasian Language Technology Association Workshop (2010). http://comp.mq.edu.au/programming/task_description/VILangTek.pdf

  • E. Ullman, Shibboleth—A Multilingual Language Identifier. Master’s thesis, Uppsala University, Uppsala (2014)

    Google Scholar 

  • T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with N-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf

  • J. Vogel, D. Tresner-Kirsch, Robust language identification in Short, Noisy Texts: Improvements to LIG, in Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), ed. by M. Atzmueller, H. Andreas pp. 43–50, Bristol, UK (2012)

    Google Scholar 

  • M. Volk, L. Fischer, P. Scheurer, B.S. Schroffenegger, R. Schwitter, P. Ströbel, B. Suter, Nunc profana tractemus. Detecting code-switching in a large corpus of 16th century letters, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2901–2908, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.311

  • C. Voss, S. Tratz, J. Laoudi, D. Briesch, Finding romanized Arabic dialect in code-mixed tweets, in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 2249–2253, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf

  • A. Wan, Leveraging data-driven methods in word-level language identification for a Multilingual Alpine Heritage Corpus, in Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pp. 45–54, San Diego, CA, June 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-1206. https://aclanthology.org/W16-1206

  • N. Wu, E. DeMattos, K. H. So, P.-Z. Chen, Ç. Çöltekin, Language discrimination and transfer learning for similar languages: Experiments with feature combinations and adaptation, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 54–63, Ann Arbor, Michigan, June 2019. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-1406. https://www.aclweb.org/anthology/W19-1406

  • Xia, M.X., Codeswitching language identification using subword information enriched word vectors, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 132–136, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5818. https://aclanthology.org/W16-5818

  • M.X. Xia, J.C.K. Cheung, Accurate Pinyin-English codeswitched language identification, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 71–79, Austin, Texas, Nov. 2016. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-5809. https://aclanthology.org/W16-5809

  • F. Xia, W. Lewis, H. Poon, Language ID in the context of harvesting language data off the web, in Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 870–878, Athens, Greece, Mar. 2009. Association for Computational Linguistics. https://aclanthology.org/E09-1099

  • H. Yamaguchi, K. Tanaka-Ishii, Text segmentation by language using minimum description length, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 969–978, Jeju Island, Korea, July 2012. Association for Computational Linguistics. https://aclanthology.org/P12-1102

  • Z. Yirmibeşoğlu, G. Eryiğit, Detecting code-switching between Turkish-English language pair, in Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 110–115, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6115. https://aclanthology.org/W18-6115

  • J. Younes, H. Achour, E. Souissi, A. Ferchichi, A deep learning approach for the Romanized Tunisian Dialect identification. Int. Arab J. Inf. Techonol. (IAJIT) 17(6), 935–946 (2020)

    Google Scholar 

  • M. Zampieri, B.G. Gebre, H. Costa, J. van Genabith, Comparing approaches to the identification of similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 66–72, Hissar, Bulgaria, Sept. 2015a. Association for Computational Linguistics. https://aclanthology.org/W15-5411

  • M. Zampieri, S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, D. Speelman, A. van den Bosch, R. Kumar, B. Lahiri, M. Jain, Language identification and morphosyntactic tagging: the second VarDial evaluation campaign, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 1–17, Santa Fe, New Mexico, USA, Aug. 2018a. Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-3901

  • M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann, P. Nakov, Overview of the DSL shared task 2015, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9, Hissar, Bulgaria, Sept. 2015b. Association for Computational Linguistics. https://www.aclweb.org/anthology/W15-5401

  • W. Zhang, R.A.J. Clark, Y. Wang, W. Li, Unsupervised language identification based on latent dirichlet allocation. Comput. Speech Lang. 39, 47–66 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Jauhiainen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Large Scale, Multi-domain Language Identification. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45822-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45821-7

  • Online ISBN: 978-3-031-45822-4

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics