Skip to main content

Features and Methods

  • Chapter
  • First Online:
  • 91 Accesses

Part of the book series: Synthesis Lectures on Human Language Technologies ((SLHLT))

Abstract

In addition to features and methods used in LI, this chapter introduces the notation devised by Jauhiainen et al. (2019e) that is used throughout this book to describe LI methods. For easier reference, we include the complete description of the notation in the first section of this chapter. It may be difficult to digest the notation without concrete examples, but the notation is gradually introduced in the descriptions of features and methods in Sects. 2.2 and 2.3. This section introduces the notation used throughout this book to describe LI methods. We have translated the notation in the original papers to our notation to make it easier to see the similarities and differences between the LI methods presented in the literature

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   32.09
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR   43.99
Price includes VAT (Finland)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://wordlist.aspell.net/dicts/.

  2. 2.

    http://urn.fi/urn:nbn:fi:lb-2020102201.

  3. 3.

    http://urn.fi/urn:nbn:fi:lb-2021062801.

  4. 4.

    The semi-random order of the characters is as follows: qazæwsxøedcårfvötgbäyhnujmikolp.

  5. 5.

    https://scikit-learn.org/.

  6. 6.

    https://scikit-learn.org/stable/modules/feature_extraction.html.

  7. 7.

    https://zenodo.org/record/163812#.YFmsDS0RrUI.

  8. 8.

    The presentation is far too long to be re-presented here.

  9. 9.

    Yet Another Language Identifier.

  10. 10.

    To the best of our knowledge, the multivariate Bernoulli version of NB has never been used for LI. See Giwa (2016) for a possible explanation.

  11. 11.

    Modern Standard Arabic (MSA) and Egyptian Arabic.

  12. 12.

    The update dates in the footnotes were checked in March 2023 to indicate how active the different projects were at the time of writing.

  13. 13.

    http://scikit-learn.org/stable/ (last updated March 2023).

  14. 14.

    https://www.cs.waikato.ac.nz/ml/weka/ (last updated March 2023).

  15. 15.

    https://wiki.pentaho.com/display/DATAMINING/Classifiers.

  16. 16.

    http://www.nltk.org (last updated January 2023).

  17. 17.

    http://www.nltk.org/py-modindex.html.

  18. 18.

    http://mallet.cs.umass.edu (last updated December 2022).

  19. 19.

    http://www.speech.sri.com/projects/srilm/download.html (last updated September 2022).

  20. 20.

    https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (last updated February 2023).

  21. 21.

    http://liblinear.bwaldvogel.de/ (last updated May 2022).

  22. 22.

    https://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last updated February 2023).

  23. 23.

    https://www.tensorflow.org (last updated March 2023).

  24. 24.

    https://keras.io.

  25. 25.

    https://pytorch.org (last updated March 2023).

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org

  • K. Abainia, S. Ouamour, H. Sayoud, Effective language identification of forum texts based on statistical approaches. Inf. Process. Manag. 52, 491–512 (2016)

    Article  Google Scholar 

  • J. Ács, L. Grad-Gyenge, T.B. Rodrigues de Rezende Oliveira, A two-level classifier for discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 73–77. https://aclanthology.org/W15-5412

  • I. Adebara, A. Elmadany, M. Abdul-Mageed, A. Inciarte, AfroLID: a neural language identification tool for African languages, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates (Association for Computational Linguistics, 2022), pp. 1958–1981. https://aclanthology.org/2022.emnlp-main.128

  • W. Adouane, N. Semmar, R. Johansson, V. Bobicev, Automatic detection of Arabicized Berber and Arabic varieties, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 63–72. https://aclanthology.org/W16-4809

  • N. Aepli, A. Anastasopoulos, A.-G. Chifu, W. Domingues, F. Faisal, M. Gaman, R.T. Ionescu, Y. Scherrer, Findings of the vardial evaluation campaign 2022. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 1–13. https://aclanthology.org/2022.vardial-1.1

  • G.I. Ahma,d J. Singla, (LISACMT) Language identification and sentiment analysis of English-Urdu ‘code-mixed’ text using LSTM, in 2022 International Conference on Inventive Computation Technologies (ICICT) (2022), pp. 430–435. https://doi.org/10.1109/ICICT54344.2022.9850505

  • B. Ahmed, S.-H. Cha, C. Tappert, Language identification from text using N-gram based cumulative frequency addition, in Proceedings of Student/Faculty Research Day (CSIS, Pace University, New York, USA, 2004), pp. 12.1–12.8

    Google Scholar 

  • B. AlKhamissi, M. Gabr, M. ElNokrashy, K. Essam, Adapting MARBERT for improved Arabic dialect identification: Submission to the NADI 2021 shared task, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 260–264. https://aclanthology.org/2021.wanlp-1.29

  • T. Alqurashi, Applying a character-level model to a short arabic dialect sentence: a saudi dialect as a case study. Appl. Sci. 12(23) (2022). ISSN 2076-3417. https://doi.org/10.3390/app122312435. https://www.mdpi.com/2076-3417/12/23/12435

  • M.Z. Ansari, T. Ahmad, M.M.S. Beg, A. Ikram, A Simple and Efficient Probabilistic Language model for Code-Mixed Text (2021a). arXiv:2106.15102

  • M.Z. Ansari, M.M.S. Beg, T. Ahmad, M.J. Khan, G. Wasim, Language Identification of Hindi-English tweets using code-mixed BERT (2021b). arXiv:2107.01202

  • A. Avenberg, Automatic language identification of short texts. Master’s thesis, Uppsala University (2020)

    Google Scholar 

  • A. Babhulgaonkar, S. Sonavane, Language identification for multilingual machine translation, in 2020 International Conference on Communication and Signal Processing (ICCSP) (2020), pp. 401–405. https://doi.org/10.1109/ICCSP48568.2020.9182184

  • A.S. Babu, P. Kumar, Comparing neural network approach with N-gram approach for text categorization. Int. J. Comput. Sci. Eng. 2(1), 80–83 (2010)

    Google Scholar 

  • I. Balažević, M. Braun, K.-R. Müller, Language Detection For Short Text Messages In Social Media (2016). arXiv:1608.08515

  • T. Baldwin, M. Lui, Language identification: the long and the short of the matter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, USA (Association for Computational Linguistics, 2010b), pp. 229–237. https://aclanthology.org/N10-1027

  • E.O. Batchelder, A Learning Experience: Training an Artificial Neural Network to Discriminate Languages. Technical report (1992)

    Google Scholar 

  • G. Bernier-Colborne, C. Goutte, Challenges in neural language identification: NRC at VarDial 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (International Committee on Computational Linguistics (ICCL), 2020), pp. 273–282. https://www.aclweb.org/anthology/2020.vardial-1.26

  • G. Bernier-Colborne, C. Goutte, S. Léger, Improving uneiform language identification with BERT, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 17–25. https://doi.org/10.18653/v1/W19-1402. https://www.aclweb.org/anthology/W19-1402

  • G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for Uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 128–134. https://www.aclweb.org/anthology/2021.vardial-1.15

  • G. Bernier-Colborne, S. Leger, C. Goutte, Transfer learning improves French cross-domain dialect identification: NRC @ VarDial 2022, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 109–118. https://aclanthology.org/2022.vardial-1.12

  • Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 115–123. https://doi.org/10.18653/v1/W17-1214. https://aclanthology.org/W17-1214

  • Y. Bestgen, Optimizing a supervised classifier for a difficult language identification problem, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 96–101. https://www.aclweb.org/anthology/2021.vardial-1.11

  • S.N. Bhattu, V. Ravi, Language identification in mixed script social media text, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE, Gandhinagar. India 2015, 39–31 (2015)

    Google Scholar 

  • S. Bird, E. Klein, E. Loper, Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit (O’Reilly Media, Inc., 2009)

    Google Scholar 

  • J. Bjerva, Byte-based Language Identification with Deep Convolutional Networks, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 119–125. https://aclanthology.org/W16-4816

  • A. Bosca, L. Dini, Language identification strategies, for cross language information retrieval, in Working notes for LogCLEF2010: The CLEF, Multilingual Logfile Analysis Track (Italy, Padua, 2010), p.2010

    Google Scholar 

  • B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in Proceedings of the Fifth Annual Workshop on Computational Learning Theory COLT’92, Pittsburgh, USA (1992), pp. 144–152

    Google Scholar 

  • G.R. Botha, Text-Based Language Identification for The South African Languages. Master’s thesis, University of Pretoria, Hatfield, Pretoria, South Africa (2008)

    Google Scholar 

  • G.R. Botha, E. Barnard, Factors that affect the accuracy of text-based language identification, in J.R. Tapamo, F. Nicolls ed. by Proceedings of the Eighteenth Annual Symposium of the Pattern Recognition Association of South Africa, Pietermaritzburg, South Africa (2007), pp. 7–12

    Google Scholar 

  • G.R. Botha, E. Barnard, Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012). (Oct.)

    Google Scholar 

  • G. Botha, V. Zimu, E. Barnard, Text-based language identification for South African languages. Trans. South African Inst. Electr. Eng. 98(4), 141–148 (2007)

    Google Scholar 

  • L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  • R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (Association for Computational Linguistics, 2014). pp. 627–632. https://doi.org/10.3115/v1/D14-1069, https://aclanthology.org/D14-1069

  • R.D. Brown, Selecting and weighting n-grams to identify 1100 languages, in Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD 2013), Plzeň, Czech Republic (2013), pp. 475–483

    Google Scholar 

  • R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)

    Article  Google Scholar 

  • Ç. Çöltekin, Dialect identification under domain shift: Experiments with discriminating Romanian and Moldavian, In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online) (International Committee on Computational Linguistics (ICCL), 2020), pp. 186–192. https://www.aclweb.org/anthology/2020.vardial-1.17

  • Ç. Çöltekin, T. Rama, Discriminating similar languages with linear SVMs and neural networks, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 15–24. https://aclanthology.org/W16-4802

  • Ç. Çöltekin, T. Rama, V. Blaschke, Tübingen-Oslo team at the VarDial 2018 evaluation campaign: an analysis of n-gram features in language variety identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018), pp. 55–65. https://aclanthology.org/W18-3906

  • G. Camposampiero, Q.A. Nguyen, F. Di Stefano, The curious case of logistic regression for Italian languages and dialects identification, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 86–98. https://aclanthology.org/2022.vardial-1.10

  • S. Carter, W. Weerkamp, M. Tsagkias, Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang. Res. Eval. 47(1), 195–215 (2013)

    Article  Google Scholar 

  • D. Castro, E. Souza, A.L.I. de Oliveira, Discriminating between Brazilian and European Portuguese National varieties on twitter texts, iIn Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS 2016) Recife, Pernambuco, Brazil (IEEE, 2016), pp. 265–270

    Google Scholar 

  • D.W. Castro, E. Souza, D. Vitório, D. Santos, A.L.I. Oliveira, Smoothed N-gram based models for tweet language identification: a case study of the Brazilian and European Portuguese National varieties. Appl. Soft Comput. 61, 1160–1172 (2017)

    Article  Google Scholar 

  • W.B. Cavnar, J.M. Trenkle, N-Gram-based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval Las Vegas, USA (1994), pp. 161–175

    Google Scholar 

  • J. Cazamias, C. Dixit, M. Marek, Large-Scale Language Classification - Writing a Detector for 200 Languages on Twitter. Stanford course report (2015)

    Google Scholar 

  • A. Ceolin, Comparing the performance of CNNs and shallow models for language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 102–112. https://www.aclweb.org/anthology/2021.vardial-1.12

  • A. Ceolin, Neural networks for cross-domain language identification. phlyers @Vardial 2022, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 99–108. https://aclanthology.org/2022.vardial-1.11

  • A. Ceolin, H. Zhang, Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (International Committee on Computational Linguistics (ICCL), 2020), pp. 265–272. https://www.aclweb.org/anthology/2020.vardial-1.25

  • B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 1–11. https://www.aclweb.org/anthology/2021.vardial-1.1

  • J.C. Chang, C.-C. Lin, Recurrent-neural-network for Language Detection on Twitter Code-Switching Corpus (2014). arXiv:1412.4314

  • C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines. ACM Trans. Intel. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  • S.F. Chen, B. Maison, Using place name data to train language identification models, in 8th European Conference on Speech Communication and Technology EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland (2003), pp. 1349–1352

    Google Scholar 

  • S.F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4), 359–394 (1999)

    Article  Google Scholar 

  • K. Church, Stress assignment in letter to sound rules for speech synthesis, in 23rd Annual Meeting of the Association for Computational Linguistics, Chicago, Illinois, USA (Association for Computational Linguistics, 1985), pp. 246–253. https://doi.org/10.3115/981210.981240, https://aclanthology.org/P85-1030

  • K. Darwish, H. Sajjad, H. Mubarak, Verifiably effective Arabic dialect identification, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (Association for Computational Linguistics, 2014), pp. 1465–1468. https://doi.org/10.3115/v1/D14-1154, https://aclanthology.org/D14-1154

  • J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota (Association for Computational Linguistics, 2019), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  • P. Diaconis, R.L. Graham, Spearmam’s Footrule as a Measure of Disarray. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(2), 262–268 (1977)

    Google Scholar 

  • N. Dongen, Analysis and Prediction of Dutch-English Code-switching in Dutch Social Media Messages. Master’s thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (2017)

    Google Scholar 

  • Y. Doval, D. Vilares, J. Vilares, Automatic language identification in twitter: adapting state-of-the-art identifiers to the iberian context, in Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014), Girona, Spain (2014), pp. 39–43

    Google Scholar 

  • S. Dowlagar, R. Mamidi, A pre-trained transformer and CNN model with joint language ID and part-of-speech tagging for code-mixed social-media text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Held (INCOMA Ltd, 2021), pp. 367–374. https://aclanthology.org/2021.ranlp-main.42

  • J. Dunn and W. Nijhof, Language identification for austronesian languages, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 6530–6539. https://aclanthology.org/2022.lrec-1.701

  • T. Dunning, Statistical Identification of Language. Technical Report MCCS 940-273, Computing Research Laboratory, New Mexico State University (1994)

    Google Scholar 

  • A. Dutta, Word-level language identification using subword embeddings for code-mixed Bangla-English social media data, in Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 76–82. https://aclanthology.org/2022.dclrl-1.10

  • A. El Mekki, A. El Mahdaouy, K. Essefar, N. El Mamoun, I. Berrada, A. Khoumsi, BERT-based multi-task model for country and province level MSA and dialectal Arabic identification, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 271–275. https://www.aclweb.org/anthology/2021.wanlp-1.31

  • R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871–1874 (2008)

    Google Scholar 

  • H.-H. Franco-Penya, L. Mamani Sanchez, Tuning Bayes baseline for dialect detection, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 227–234. https://aclanthology.org/W16-4829

  • M. Franco-Salvador, N. Plotnikova, N. Pawar, Y. Benajiba, Subword-based deep averaging networks for author profiling – notebook for PAN at CLEF 2017, in L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl, ed. by Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017). CEUR-WS.org. http://ceur-ws.org/Vol-1866/

  • M. Franco-Salvador, P. Rosso, F. Rangel, Distributed representations of words and documents for discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 11–16. https://aclanthology.org/W15-5403

  • F. Gaim, W. Yang, J.C. Park, GeezSwitch: language identification in typologically related low-resourced East African languages, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 6578–6584. https://aclanthology.org/2022.lrec-1.707

  • P. Gamallo, J.R. Pichel, I. Alegria, A perplexity-based method for similar languages discrimination, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 109–114. https://doi.org/10.18653/v1/W17-1213. https://aclanthology.org/W17-1213

  • S. Gella, K. Bali, M. Choudhury, “ye word kis lang ka hai bhai?” Testing the limits of word level language identification, in Proceedings of ICON-2014, the 11th International Conference on Natural Language Processing, Goa, India (2014)

    Google Scholar 

  • M. Gemeda Yigezu, A. Lambebo Tonja, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbukh, Word level language identification in code-mixed Kannada-English texts using deep learning approach, in Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, IIIT Delhi, New Delhi, India (Association for Computational Linguistics, 2022), pp. 29–33. https://aclanthology.org/2022.icon-wlli.6

  • E. Giguet, Categorization according to language: a step toward combining linguistic knowledge and statistic learning, in Proceedings of the International Workshop on Parsing Technologies (IWPT’95), Prague - Karlovy Vary, Czech Republic (1995)

    Google Scholar 

  • N. Gillin, Is encoder-decoder transformer the shiny hammer? in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 80–85. https://aclanthology.org/2022.vardial-1.9

  • O. Giwa, Language Identification for Proper Name Pronunciation. Ph.D. thesis, North-West University, Vaal Triangle (2016)

    Google Scholar 

  • O. Giwa, M.H. Davel, Language identification of individual words with joint sequence models, in Proceedings of Interspeech 2014. Singapore (2014)

    Google Scholar 

  • C. Goutte, S. Léger, Experiments in discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 78–84. https://aclanthology.org/W15-5413

  • C. Goutte, S. Léger, M. Carpuat, The NRC system for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014), pp. 139–145. https://doi.org/10.3115/v1/W14-5316. URL https://aclanthology.org/W14-5316

  • G. Grefenstette, Comparing two language identification schemes, in Proceedings of the 3rd International conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (1995)

    Google Scholar 

  • S. Gundapu, R. Mamidi, Word level language identification in English Telugu code mixed data, in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong (Association for Computational Linguistics, 2018). Accessed from 1–3 Dec. 2018. https://www.aclweb.org/anthology/Y18-1021

  • D.K. Gupta, S. Kumar, A. Ekbal, Machine learning approach for language identification & transliteration: shared task report of IITP-TS, in Forum for Information Retrieval Evaluation (FIRE) (Bangalore, India, 2014), pp.60–64

    Google Scholar 

  • R. Haas, L. Derczynski, Discriminating between similar Nordic languages, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kiyv, Ukraine (Association for Computational Linguistics, 2021), pp. 67–75. https://aclanthology.org/2021.vardial-1.8

  • H. Haddad, A.C. Rouhou, A. Messaoudi, A. Korched, C. Fourati, A. Sellami, M. Ben HajHmida, F. Ghriss, TunBERT: pretraining BERT for Tunisian dialect understanding. SN Comput. Sci. 4(2), 194 (2023)

    Google Scholar 

  • J. Häkkinen, J. Tian, N-gram and decision tree based language identification for written words, in Conference Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2001), Madonna di Campiglio, Italy (2001), pp.s 335–338

    Google Scholar 

  • M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett 11(1), 10–18 (2009)

    Article  Google Scholar 

  • A. Hamzah, Deteksi bahasa untuk dokumen teks berbahasa Indonesia, in Seminar Nasional Informatika, (semnasIF 2010), Jakarta. Indonesia 2010, A5–A13 (2010)

    Google Scholar 

  • A. Hanani, A. Qaroush, S. Taylor, Classifying ASR transcriptions according to Arabic dialect, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 126–134. https://aclanthology.org/W16-4817

  • H. Hassanpour, M. AlyanNezhadi, M. Mohammadi, A signal processing method for text language identification. Int. J. Eng. 34(6), 1413–1418 (2021)

    Google Scholar 

  • J. He, Z. Zhang, X. Zhao, P. Li, Y. Yan, Similar language identification for Uyghur and Kazakh on short spoken texts, in Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC 2016), vol. 2, Hangzhou, China (2016), pp. 496–499

    Google Scholar 

  • R. Hecht-Nielsen, Theory of the backpropagation neural network, in Proceedings of the International Joint Conference on Neural Networks (IJCNN 1989), Washington, DC, USA (1989), pp. I593–I605. https://doi.org/10.1109/IJCNN.1989.118638

  • P. Henrich, Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a German text-to-speech system, in First European Conference on Speech Communication and Technology, Paris, France (1989), pp. 2220–2223

    Google Scholar 

  • A.F. Hidayatullah, A. Qazi, D.T.C. Lai, R.A. Apong, A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development. IEEE Access 10, 122812–122831 (2022). https://doi.org/10.1109/ACCESS.2022.3223703

    Article  Google Scholar 

  • S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  • D.W. Hosmer, S. Lemeshow, R.X. Sturdivant, Applied logistic regression. Wiley Series in Probability and Statistics, 3rd edn. (Wiley, Hoboken, N.J., USA, 2013)

    Google Scholar 

  • A.S. House, E.P. Neuburg, Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)

    Google Scholar 

  • A. Hussain, M.U. Arshad, An Attention Based Neural Network for Code Switching Detection: English & Roman Urdu (2021). arXiv:2103.02252

  • D.-M. Iliescu, R. Grand, S. Qirko, R. van der Goot, Much gracias: semi-supervised code-switch detection for Spanish-English: how far can we get? in Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching (Association for Computational Linguistics, 2021), pp. 65–71. https://www.aclweb.org/anthology/2021.calcs-1.9

  • A. Jaech, G. Mulcaire, S. Hathi, M. Ostendorf, N.A. Smith, Hierarchical character-word models for language identification, in Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media, Austin, TX, USA (Association for Computational Linguistics, 2016a), pp. 84–93. https://doi.org/10.18653/v1/W16-6212. https://aclanthology.org/W16-6212

  • A. Jaech, G. Mulcaire, M. Ostendorf, N.A. Smith, A neural model for language identification in code-switched tweets, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas (Association for Computational Linguistics, 2016b), pp. 60–64. https://doi.org/10.18653/v1/W16-5807. https://aclanthology.org/W16-5807

  • R. Jalam Apprentissage Automatique et Catégorisation de Textes Multilingues. Ph.D. thesis, Université Lumière Lyon 2 (2003)

    Google Scholar 

  • R. Jalam, O. Teytaud, Kernel-based text categorization, in Proceedings of the International Joint Conference on Neural Networks (IJCNN’01), vol. 3 Washington, DC, USA (2001a), pp. 1891–1896

    Google Scholar 

  • R. Jalam, O. Teytaud, Identification de la Langue et Catégorisation de Textes Basées sur les N-grammes, in Journées Francophones d’extraction et de gestion de connaissances (EGC’2001). ed. by H. Briand, F. Guillet (France, Nantes, 2001), pp.227–238

    Google Scholar 

  • T. Jauhiainen, Tekstin kielen automaattinen tunnistaminen. Master’s thesis, University of Helsinki, Helsinki (2010)

    Google Scholar 

  • T. Jauhiainen, H. Jauhiainen, and K. Lindén, Optimizing naive Bayes for Arabic dialect identification, in Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates (Hybrid) (Association for Computational Linguistics, 2022c), pp. 409–414. https://aclanthology.org/2022.wanlp-1.40

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating similar languages with token-based backoff, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015b), pp. 44–51. https://www.aclweb.org/anthology/W15-5408

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Experiments in language variety geolocation and dialect identification, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online). (International Committee on Computational Linguistics (ICCL), 2020b), pp. 220–231. https://www.aclweb.org/anthology/2020.vardial-1.21

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-based experiments in Swiss German dialect identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018b), pp. 254–262. https://www.aclweb.org/anthology/W18-3929

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-OTS, off-the-shelf language identifier for text, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022a), pp. 3912–3922. https://aclanthology.org/2022.lrec-1.416

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Italian language and dialect identification and regional French variety detection using adaptive naive Bayes, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022b), pp. 119–129. https://aclanthology.org/2022.vardial-1.13

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Iterative language model adaptation for Indo-Aryan language identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018a), pp. 66–75. https://www.aclweb.org/anthology/W18-3907

  • T. Jauhiainen, H. Jauhiainen, K. Lindén, Naive Bayes-based experiments in Romanian dialect identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021a), pp. 76–83 https://www.aclweb.org/anthology/2021.vardial-1.9

  • T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online) (International Committee on Computational Linguistics (ICCL), 2020c), pp. 173–185. https://www.aclweb.org/anthology/2020.vardial-1.16

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluating HeLI with non-linear mappings, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017b), pp. 102–108. https://doi.org/10.18653/v1/W17-1212. URL https://www.aclweb.org/anthology/W17-1212

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothenburg, Sweden (Association for Computational Linguistics, 2017a), pp. 183–191. https://www.aclweb.org/anthology/W17-0221

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, HeLI, a word-based backoff method for language identification, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 153–162. https://www.aclweb.org/anthology/W16-4820

  • T. Jauhiainen, K. Lindén, H. Jauhiainen, Language set identification in noisy synthetic multilingual documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), Cairo, Egypt (2015c), pp. 633–643

    Google Scholar 

  • T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, K. Lindén, Automatic language identification in texts: a survey. J. Artif. Intell. Res. 65, 675–782 (2019e). ISSN 1076-9757. https://doi.org/10.1613/jair.1.11675

  • T. Jauhiainen, T. Ranasinghe, M. Zampieri, Comparing approaches to Dravidian language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021c), pp. 120–127. https://www.aclweb.org/anthology/2021.vardial-1.14

  • T. Jo, Neural text categorizer for exclusive text categorization. J. Inf. Process. Syst. 4, 77–86 (2008)

    Article  Google Scholar 

  • M.I. Jordan, Serial order: a parallel Distributed Processing Approach. Technical report, Institute for Cognitive Science, University of California, San Diego (1986)

    Google Scholar 

  • A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain (Association for Computational Linguistics, 2017), pp. 427–431. https://www.aclweb.org/anthology/E17-2068

  • D. Jurgens, Y. Tsvetkov, D. Jurafsky, incorporating dialectal variability for socially equitable language identification, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada (Association for Computational Linguistics, 2017), pp. 51–57. https://doi.org/10.18653/v1/P17-2009. URL https://aclanthology.org/P17-2009

  • S. Kadri and A. Moussaoui. An Effective Method to Recognize the Language of a Text in a Collection of Multilingual Documents. In Proceedings of the International Conference on Electronics, Computer and Computation (ICECCO 2013), pages 208–211, Ankara, Turkey, 2013

    Google Scholar 

  • N.J. Kalita, A.G. Agarwala, J. Das, Word level language identification on code-mixed English-Bodo text, in IOP Conference Series: Materials Science and Engineering, vol. 1020 (IOP Publishing, 2021), p. 012027

    Google Scholar 

  • C.M. Kastner, G.A. Covington, A.A. Levine, J.W. Lockwood, Hail: a hardware-accelerated algorithm for language identification, in T. Rissa, S. Wilton, P. Leong, ed. by Proceedings of the 2005 International Conference on Field Programmable Logic and Applications (FPL), Tampere, Finland (2005), pp. 499–504

    Google Scholar 

  • T. Kerwin, Classification of Natural Language Based on Character Frequency. (Ohio Supercomputer Center, 2006)

    Google Scholar 

  • E. Ketzan, N. Werner, ‘entrez!’she called: evaluating language identification tools in English literary texts, in Proceedings of the Computational Humanities Research Conference 2022 (CHR 2022), Antwerp, Belgium (2022), pp. 366–373

    Google Scholar 

  • L. Kevers, CoSwID, a code switching identification method suitable for under-resourced languages, in Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, Marseille, France (European Language Resources Association, 2022), pp. 112–121. https://aclanthology.org/2022.sigul-1.15

  • S. Kim, J. Park, Automatic Detection of Character Encoding and Language. Technical report, Stanford University (2007)

    Google Scholar 

  • B.P. King, Practical Natural Language Processing for Low-Resource Languages. Ph.D, thesis, University of Michigan (2015)

    Google Scholar 

  • L. King, S. Kübler, W. Hooper, Word-level language identification in The Chymistry of Isaac Newton. Digit. Scholarsh. Humanit. 30(4), 532–540 (2015)

    Article  Google Scholar 

  • T. Kocmi, O. Bojar, LanideNN: Multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain (Association for Computational Linguistics, 2017), pp. 927–936. https://aclanthology.org/E17-1087

  • D. Kosmajac, Author and Language Profiling of Short Texts. Ph.D. thesis, Dalhausie University (2020)

    Google Scholar 

  • C. Kruengkrai, P. Srichaivattana, V. Sornlertlamvanich, H. Isahara, Language identification based on string kernels, in Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT-2005), vol. 2, Beijing, China (2005), pp. 896–899

    Google Scholar 

  • S. Kulikowski, Language Identification of Short Texts (The University of West Florida, Newsgroup article, Educational Research and Development Center, 1991)

    Google Scholar 

  • A. Lambebo Tonja, M. Gemeda Yigezu, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbuk, Transformer-based model for word level language identification in code-mixed Kannada-English texts (2022). arXiv:2211.14459

  • S. Leidig, Single and Combined Features for the Detection of Anglicisms in German and Afrikaans. Bachelor’s Thesis, Karlsruhe Institute of Technology (2014)

    Google Scholar 

  • Y. Li, T. Baldwin, T. Cohn, What’s in a domain? Learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational. Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana (Association for Computational Linguistics, 2018), pp. 474–479. https://doi.org/10.18653/v1/N18-2076. URL https://aclanthology.org/N18-2076

  • N. Ljubešić, D. Kranjcić, Discriminating between VERY similar languages among twitter users, in Proceedings of the 9th Language Technologies Conference, Ljubljana, Slovenia (2014), pp. 90–94

    Google Scholar 

  • A.F. Llitjós, Improving Pronunciation Accuracy of Proper Names with Language Origin Classes. Master’s thesis, Carnegie Mellon University, Pittsburgh, PA, USA (2001)

    Google Scholar 

  • M. Lui, T. Baldwin, Accurate language identification of Twitter messages, in Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden (Association for Computational Linguistics, 2014), pp. 17–25. https://doi.org/10.3115/v1/W14-1303. https://aclanthology.org/W14-1303

  • M. Lui, T. Baldwin, langid.py: an off-the-shelf language identification tool, in Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea (Association for Computational Linguistics, 2012), pp. 25–30. https://aclanthology.org/P12-3005

  • M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014)

    Article  Google Scholar 

  • S. MacNamara, P. Cunningham, J. Byrne, Neural networks for language identification: a comparative study. Inf. Process. Manag. 34(4), 395–403 (1998)

    Article  Google Scholar 

  • M. Majliš, Large Multilingual Corpus. Master’s thesis, Charles University in Prague, Prague (2011)

    Google Scholar 

  • M. Majliš, Yet another language identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (Association for Computational Linguistics, 2012), pp. 46–54. https://aclanthology.org/E12-3006

  • M. Majliš, Z. Žabokrtský, Language richness of the web, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), , Istanbul, Turkey (European Language Resources Association (ELRA), 2012), pp. 2927–2934. http://www.lrec-conf.org/proceedings/lrec2012/pdf/267_Paper.pdf

  • S. Malmasi, M. Dras, Automatic language identification for Persian and Dari texts, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, Bali, Indonesia (2015a), pp. 59–64

    Google Scholar 

  • S. Malmasi, M. Dras, Feature hashing for language and dialect identification, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada (Association for Computational Linguistics, 2017), pp. 399–403. https://doi.org/10.18653/v1/P17-2063. https://aclanthology.org/P17-2063

  • S. Malmasi, M. Dras., Language identification using classifier ensembles, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015b), pp. 35–43. https://aclanthology.org/W15-5407

  • S. Malmasi, M. Zampieri, Arabic dialect identification in speech transcripts, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 106–113. https://aclanthology.org/W16-4814

  • S. Malmasi, M. Zampieri, Arabic dialect identification using iVectors and ASR transcripts, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017b), pp. 178–183. https://doi.org/10.18653/v1/W17-1222. URL https://aclanthology.org/W17-1222

  • S. Malmasi, M. Zampieri, German dialect identification in interview transcriptions, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017a), pp. 164–169. https://doi.org/10.18653/v1/W17-1220, https://aclanthology.org/W17-1220

  • S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 1–14. https://www.aclweb.org/anthology/W16-4801

  • P. Martadinata, B.D. Trisedya, H.M. Manurung, M. Adriani, Building Indonesian local language detection tools using wikipedia data, in Y. Murakami, D. Lin, ed. by Worldwide Language Service Infrastructure (Springer, 2016), pp. 113–123

    Google Scholar 

  • M. Martinc, I. Škrjanec, K. Zupan, S. Pollak, PAN, author profiling - gender and language variety prediction-notebook for pan at CLEF, in 2017 Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017)

    Google Scholar 

  • P. Mathur, A. Misra, E. Budur, LIDE: Language Identification from Text Documents (2017). arXiv:1701.03682

  • A.K. McCallum, Mallet: A machine learning for language toolkit (2002). http://mallet.cs.umass.edu

  • P. McNamee, Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)

    Google Scholar 

  • M. Medvedeva, M. Kroon, B. Plank, When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 156–163. https://doi.org/10.18653/v1/W17-1219, https://aclanthology.org/W17-1219

  • S. Mehta, T. Jain, N. Aggarwal, Multilingual short text analysis of twitter using random forest approach, in B. Villazón-Terrazas, F. Ortiz-Rodríguez, S. Tiwari, A. Goyal, M. Jabbar, ed. by Knowledge graphs and semantic web (Springer International Publishing, Cham, 2021), pp. 84–92. ISBN 978-3-030-91305-2

    Google Scholar 

  • I. Mendoza, J. Mendelsohn, Exploring Techniques in Distinguishing Similar Languages. Stanford course project (2017)

    Google Scholar 

  • A. Miletic, Y. Scherrer. Ocwikidisc: a corpus of wikipedia talk pages in occitan, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, , Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 70–79. https://aclanthology.org/2022.vardial-1.8

  • A. Moodley, Language Identification With Decision Trees: Identification Of Individual Words In The South African Languages. Bachelor’s Thesis, University of South Africa (2016)

    Google Scholar 

  • A. Mukherjee, A. Ravi, K. Datta, Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing, in FIRE ’14 Proceedings of the Forum for Information Retrieval, Bangalore, India (2014), pp. 86–90

    Google Scholar 

  • S. Mustonen, Multiple discriminant analysis in linguistic problems. Stat. Methods Linguist. 4, 37–44 (1965)

    Google Scholar 

  • H. Ney, U. Essen, R. Kneser, On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)

    Article  Google Scholar 

  • A.Y. Ng, M.I. Jordan, On Discriminative vs. Generative classifiers: a comparison of logistic regression and naive Bayes, in Advances in Neural Information Processing Systems 15 (NIPS 2002). ed. by S. Becker, S. Thrun, K. Obermayer (Vancouver, British Columbia, Canada, 2002), pp.841–848

    Google Scholar 

  • NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G.M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K.R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N.F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No language left behind: Scaling human-centered machine translation (2022). https://arxiv.org/abs/2207.04672

  • B. Okgetheng, E.A. Budu, Word-based bantu language identification using naïve bayes, in 2022 IST-Africa Conference (IST-Africa) (2022), pp. 1–7. https://doi.org/10.23919/IST-Africa56635.2022.9845618

  • G.H. Paetzold, M. Zampieri, Experiments in cuneiform language identification, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 209–213. https://doi.org/10.18653/v1/W19-1423. URL https://www.aclweb.org/anthology/W19-1423

  • S. Patel, V. Desai, LIGA and syllabification approach for language identification and back transliteration: a shared task report by DA-IICT, in FIRE ’14 Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India (2014), pp. 43–47

    Google Scholar 

  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    Google Scholar 

  • F. Peng, D. Schuurmans, Combining Naive Bayes and n-Gram language models for text classification, in Proceedings of the 25th European Conference on IR Research, Advances in Information Retrieval: (ECIR 2003), Pisa, Italy (Springer, Berlin, Heidelberg, 2003), pp. 335–350. ISBN 978-3-540-36618-8, https://doi.org/10.1007/3-540-36618-0_24, http://dx.doi.org/10.1007/3-540-36618-0_24

  • J. Porta, J.-L. Sancho, Using maximum entropy models to discriminate between similar languages and varieties, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014), pp. 120–128. https://doi.org/10.3115/v1/W14-5314. https://aclanthology.org/W14-5314

  • A. Poutsma, Applying Monte Carlo techniques to language identification. Lang. Comput. 45(1), 179–189 (2002)

    Google Scholar 

  • J.M. Prager, Linguini: language identification for multilingual documents, in Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA (1999)

    Google Scholar 

  • Y. Qu, G. Grefenstette, Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation, in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain (2004), pp. 183–190. https://doi.org/10.3115/1218955.1218979, https://aclanthology.org/P04-1024

  • F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th Author Profiling Task at PAN 2017: gender and language variety identification in twitter, in L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl, ed. by Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017). CEUR-WS.org. http://ceur-ws.org/Vol-1866/

  • M.D. Rau, Language Identification by Statistical Analysis. Master’s thesis, Naval Postgraduate School, Monterey (1974)

    Google Scholar 

  • C. Sabty, I. Mesabah, Özlem Çetinoğlu, S. Abdennadher, Language identification of intra-word code-switching for Arabic-English. Array 12, 100104 (2021). ISSN 2590-0056. https://doi.org/10.1016/j.array.2021.100104. https://www.sciencedirect.com/science/article/pii/S2590005621000473

  • Y. Samih, S. Maharjan, M. Attia, L. Kallmeyer, T. Solorio, Multilingual code-switching identification via LSTM recurrent neural networks, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas (Association for Computational Linguistics, 2016), pp. 50–59. https://doi.org/10.18653/v1/W16-5806. URL https://aclanthology.org/W16-5806

  • Y. Samih, W. Maier, Detecting code-switching in Moroccan Arabic social media, in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2016 IJCAI), New York City, USA (2016)

    Google Scholar 

  • N. Sarma, Automatic Language Identification in Online Multilingual Conversations. Ph.D. thesis, Indian Institute of Technology Guwahati (2021)

    Google Scholar 

  • N. Sarma, R. Sanasam Singh, D. Goswami, Switchnet: learning to switch for word-level language identification in code-mixed social media text. Nat Lang Eng 28(3), 337–359 (2022). https://doi.org/10.1017/S1351324921000115

  • F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  • A. Selamat, C.-C. Ng, Y. Mikami, Arabic script web documents language identification using decision tree-ARTMAP model, in Proceedings of the International Conference on Convergence Information Technology (ICCIT 2007), Gyeongju, Korea. (IEEE, 2007), pp. 717–722

    Google Scholar 

  • R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016), pp. 1715–1725

    Google Scholar 

  • K. Shaffer, Language clustering for multilingual named entity recognition, in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic (Association for Computational Linguistics, 2021), pp. 40–45. https://aclanthology.org/2021.findings-emnlp.4

  • P. Shrestha, Incremental N-gram approach for language identification in code-switched text, in Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar (Association for Computational Linguistics, 2014), pp. 133–138. https://doi.org/10.3115/v1/W14-3916, https://aclanthology.org/W14-3916

  • P. Sibun, J.C. Reynar, Language identification: examining the issues, in Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-96), Las Vegas, USA (1996), pp. 125–135

    Google Scholar 

  • A.K. Singh, Modeling and Application of Linguistic Similarity. Ph.D. thesis, International Institute of Information Technology, Hyderabad (2010)

    Google Scholar 

  • A.K. Singh, Study of some distance measures for language and encoding identification, in Proceedings of the Workshop on Linguistic Distances, Sydney, Australia (2006), pp. 63–72

    Google Scholar 

  • X. Song, A. Salcianu, Y. Song, D. Dopson, D. Zhou, Fast wordpiece tokenization, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021), pp. 2089–2103

    Google Scholar 

  • C. Souter, G. Churcher, J. Hayes, J. Hughes, S. Johnson, Natural Language Identification using Corpus-Based Models. Hermes, J. Linguist. 13, 183–203 (1994)

    Google Scholar 

  • A. Stensby, B.J. Oommen, O.-C. Granmo, Language detection and tracking in multilingual documents using weak estimators, in Proceedings of the Joint IAPR International Workshop Structural, Syntactic, and Statistical Pattern Recognition (SSPR &SPR 2010), Cesme, Izmir, Turkey (2010), pp. 600–609

    Google Scholar 

  • A. Stolcke, SRILM - an extensible language modeling toolkit, in Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP-2002), Denver, Colorado, USA (2002), pp. 901–904

    Google Scholar 

  • H. Takçı, T. Güngör, A high performance centroid-based classification approach for language identification. Pattern Recognit. Lett. 33(16), 2077–2084 (2012)

    Article  Google Scholar 

  • L. Tan, M. Zampieri, N. Ljubešić, J. Tiedemann, Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection, in Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland (2014)

    Google Scholar 

  • W.J. Teahan, Text classification and segmentation using minimum cross-entropy, in Proceedings of the 6th International Conference Recherche d’Information Assistee par Ordinateur (RIAO’00), Paris, France (2000), pp. 943–961

    Google Scholar 

  • S. Thara, P. Poornachandran, Transformer based language identification for Malayalam-English code-mixed text. IEEE Access 9, 118837–118850 (2021). https://doi.org/10.1109/ACCESS.2021.3104106

    Article  Google Scholar 

  • J. Tian, J. Häkkinen, S. Riis, K.J. Jensen, On text-based language identification for multilingual speech recognition systems, in Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP2002), Denver, Colorado, USA (2002)

    Google Scholar 

  • J. Tian, J. Suontausta, Scalable neural network based language identification from written text, in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 1, Hong Kong (2003), pp. 48–51

    Google Scholar 

  • M. Toftrup, S. Asger Sørensen, M.R. Ciosici, I. Assent, A reproduction of Apple’s bi-directional LSTM models for language identification in short strings, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop (Association for Computational Linguistics, 2021), pp. 36–42. https://www.aclweb.org/anthology/2021.eacl-srw.6

  • E. Tromp, Multilingual Sentiment Analysis on Social Media. Master’s thesis, Eindhoven University of Technology, Eindhoven (2011)

    Google Scholar 

  • E. Tromp, M. Pechenizkiy, Graph-based N-gram language identification on short texts, in Proceedings of the 20th Annual Belgian Dutch Conference on Machine Learning (Benelearn 2011), The Hague, Netherlands (2011), pp. 27–34

    Google Scholar 

  • D. Tudoreanu, DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 202–208. https://doi.org/10.18653/v1/W19-1422. https://aclanthology.org/W19-1422

  • P. v. Cann, Dialect Identification on Twitter: A Research About the Detection of the Limburgian Dialect from Twitter messages. Master’s thesis, University of Tilburg (2015)

    Google Scholar 

  • T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with n-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta (European Language Resources Association (ELRA), 2010). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf

  • J. Vogel, D. Tresner-Kirsch, Robust language identification in short, noisy texts: improvements to LIGA, in M. Atzmueller, H. Andreas ed. by Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), Bristol, UK (2012), pp. 43–50

    Google Scholar 

  • A. Wadhawan, Dialect identification in nuanced Arabic tweets using farasa segmentation and AraBERT, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 291–295. https://aclanthology.org/2021.wanlp-1.35

  • M. Watson, C. Qian, J. Bischof, F. Chollet, et al., Kerasnlp (2022). https://github.com/keras-team/keras-nlp

  • I. Weber, Language identification in a highly unbalanced dataset. Master’s thesis, Stellenbosch University (2022)

    Google Scholar 

  • D. Widdows, C. Brew, Language identification with a reciprocal rank classifier (2021). arXiv:2109.09862

  • I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques, 4th edn. (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2016), p.0128042915

    Google Scholar 

  • D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992)

    Article  Google Scholar 

  • M. Yasir, L. Chen, A. Khatoon, M.A. Malik, F. Abid, Mixed script identification using automated DNN Hyperparameter optimization (Comput. Intell, Neurosci, 2021)

    Book  Google Scholar 

  • J.-L. You, Y.-N. Chen, M. Chu, F.K. Soong, J.-L. Wang, Identifying language origin of named entity with multiple information sources. IEEE Trans Audio, Speech Lang Process 16(6), 1077–1086 (2008)

    Article  Google Scholar 

  • G.-E. Zaharia, A.-M. Avram, D.-C. Cercel, T. Rebedea, Dialect identification through adversarial learning and knowledge distillation on Romanian BERT, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 113–119. https://aclanthology.org/2021.vardial-1.13

  • J.D. Zamora, A.F. Bruzòn, R.O. Bueno, Tweets language identification using feature weighting, in Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014) Girona, Spain (2014), pp. 30–34

    Google Scholar 

  • M. Zampieri, B. G. Gebre, H. Costa, J. van Genabith, Comparing approaches to the identification of similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015a), pp. 66–72. https://aclanthology.org/W15-5411

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Jauhiainen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Features and Methods. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45822-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45821-7

  • Online ISBN: 978-3-031-45822-4

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics