Abstract
In addition to features and methods used in LI, this chapter introduces the notation devised by Jauhiainen et al. (2019e) that is used throughout this book to describe LI methods. For easier reference, we include the complete description of the notation in the first section of this chapter. It may be difficult to digest the notation without concrete examples, but the notation is gradually introduced in the descriptions of features and methods in Sects. 2.2 and 2.3. This section introduces the notation used throughout this book to describe LI methods. We have translated the notation in the original papers to our notation to make it easier to see the similarities and differences between the LI methods presented in the literature
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
The semi-random order of the characters is as follows: qazæwsxøedcårfvötgbäyhnujmikolp.
- 5.
- 6.
- 7.
- 8.
The presentation is far too long to be re-presented here.
- 9.
Yet Another Language Identifier.
- 10.
To the best of our knowledge, the multivariate Bernoulli version of NB has never been used for LI. See Giwa (2016) for a possible explanation.
- 11.
Modern Standard Arabic (MSA) and Egyptian Arabic.
- 12.
The update dates in the footnotes were checked in March 2023 to indicate how active the different projects were at the time of writing.
- 13.
http://scikit-learn.org/stable/ (last updated March 2023).
- 14.
https://www.cs.waikato.ac.nz/ml/weka/ (last updated March 2023).
- 15.
- 16.
http://www.nltk.org (last updated January 2023).
- 17.
- 18.
http://mallet.cs.umass.edu (last updated December 2022).
- 19.
http://www.speech.sri.com/projects/srilm/download.html (last updated September 2022).
- 20.
https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (last updated February 2023).
- 21.
http://liblinear.bwaldvogel.de/ (last updated May 2022).
- 22.
https://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last updated February 2023).
- 23.
https://www.tensorflow.org (last updated March 2023).
- 24.
- 25.
https://pytorch.org (last updated March 2023).
References
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/. Software available from tensorflow.org
K. Abainia, S. Ouamour, H. Sayoud, Effective language identification of forum texts based on statistical approaches. Inf. Process. Manag. 52, 491–512 (2016)
J. Ács, L. Grad-Gyenge, T.B. Rodrigues de Rezende Oliveira, A two-level classifier for discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 73–77. https://aclanthology.org/W15-5412
I. Adebara, A. Elmadany, M. Abdul-Mageed, A. Inciarte, AfroLID: a neural language identification tool for African languages, in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates (Association for Computational Linguistics, 2022), pp. 1958–1981. https://aclanthology.org/2022.emnlp-main.128
W. Adouane, N. Semmar, R. Johansson, V. Bobicev, Automatic detection of Arabicized Berber and Arabic varieties, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 63–72. https://aclanthology.org/W16-4809
N. Aepli, A. Anastasopoulos, A.-G. Chifu, W. Domingues, F. Faisal, M. Gaman, R.T. Ionescu, Y. Scherrer, Findings of the vardial evaluation campaign 2022. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 1–13. https://aclanthology.org/2022.vardial-1.1
G.I. Ahma,d J. Singla, (LISACMT) Language identification and sentiment analysis of English-Urdu ‘code-mixed’ text using LSTM, in 2022 International Conference on Inventive Computation Technologies (ICICT) (2022), pp. 430–435. https://doi.org/10.1109/ICICT54344.2022.9850505
B. Ahmed, S.-H. Cha, C. Tappert, Language identification from text using N-gram based cumulative frequency addition, in Proceedings of Student/Faculty Research Day (CSIS, Pace University, New York, USA, 2004), pp. 12.1–12.8
B. AlKhamissi, M. Gabr, M. ElNokrashy, K. Essam, Adapting MARBERT for improved Arabic dialect identification: Submission to the NADI 2021 shared task, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 260–264. https://aclanthology.org/2021.wanlp-1.29
T. Alqurashi, Applying a character-level model to a short arabic dialect sentence: a saudi dialect as a case study. Appl. Sci. 12(23) (2022). ISSN 2076-3417. https://doi.org/10.3390/app122312435. https://www.mdpi.com/2076-3417/12/23/12435
M.Z. Ansari, T. Ahmad, M.M.S. Beg, A. Ikram, A Simple and Efficient Probabilistic Language model for Code-Mixed Text (2021a). arXiv:2106.15102
M.Z. Ansari, M.M.S. Beg, T. Ahmad, M.J. Khan, G. Wasim, Language Identification of Hindi-English tweets using code-mixed BERT (2021b). arXiv:2107.01202
A. Avenberg, Automatic language identification of short texts. Master’s thesis, Uppsala University (2020)
A. Babhulgaonkar, S. Sonavane, Language identification for multilingual machine translation, in 2020 International Conference on Communication and Signal Processing (ICCSP) (2020), pp. 401–405. https://doi.org/10.1109/ICCSP48568.2020.9182184
A.S. Babu, P. Kumar, Comparing neural network approach with N-gram approach for text categorization. Int. J. Comput. Sci. Eng. 2(1), 80–83 (2010)
I. Balažević, M. Braun, K.-R. Müller, Language Detection For Short Text Messages In Social Media (2016). arXiv:1608.08515
T. Baldwin, M. Lui, Language identification: the long and the short of the matter, in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, USA (Association for Computational Linguistics, 2010b), pp. 229–237. https://aclanthology.org/N10-1027
E.O. Batchelder, A Learning Experience: Training an Artificial Neural Network to Discriminate Languages. Technical report (1992)
G. Bernier-Colborne, C. Goutte, Challenges in neural language identification: NRC at VarDial 2020, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (International Committee on Computational Linguistics (ICCL), 2020), pp. 273–282. https://www.aclweb.org/anthology/2020.vardial-1.26
G. Bernier-Colborne, C. Goutte, S. Léger, Improving uneiform language identification with BERT, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 17–25. https://doi.org/10.18653/v1/W19-1402. https://www.aclweb.org/anthology/W19-1402
G. Bernier-Colborne, S. Leger, C. Goutte, N-gram and neural models for Uralic language identification: NRC at VarDial 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 128–134. https://www.aclweb.org/anthology/2021.vardial-1.15
G. Bernier-Colborne, S. Leger, C. Goutte, Transfer learning improves French cross-domain dialect identification: NRC @ VarDial 2022, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 109–118. https://aclanthology.org/2022.vardial-1.12
Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 115–123. https://doi.org/10.18653/v1/W17-1214. https://aclanthology.org/W17-1214
Y. Bestgen, Optimizing a supervised classifier for a difficult language identification problem, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 96–101. https://www.aclweb.org/anthology/2021.vardial-1.11
S.N. Bhattu, V. Ravi, Language identification in mixed script social media text, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE, Gandhinagar. India 2015, 39–31 (2015)
S. Bird, E. Klein, E. Loper, Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit (O’Reilly Media, Inc., 2009)
J. Bjerva, Byte-based Language Identification with Deep Convolutional Networks, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 119–125. https://aclanthology.org/W16-4816
A. Bosca, L. Dini, Language identification strategies, for cross language information retrieval, in Working notes for LogCLEF2010: The CLEF, Multilingual Logfile Analysis Track (Italy, Padua, 2010), p.2010
B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in Proceedings of the Fifth Annual Workshop on Computational Learning Theory COLT’92, Pittsburgh, USA (1992), pp. 144–152
G.R. Botha, Text-Based Language Identification for The South African Languages. Master’s thesis, University of Pretoria, Hatfield, Pretoria, South Africa (2008)
G.R. Botha, E. Barnard, Factors that affect the accuracy of text-based language identification, in J.R. Tapamo, F. Nicolls ed. by Proceedings of the Eighteenth Annual Symposium of the Pattern Recognition Association of South Africa, Pietermaritzburg, South Africa (2007), pp. 7–12
G.R. Botha, E. Barnard, Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 26(5), 307–320 (2012). (Oct.)
G. Botha, V. Zimu, E. Barnard, Text-based language identification for South African languages. Trans. South African Inst. Electr. Eng. 98(4), 141–148 (2007)
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
R. Brown, Non-linear mapping for improved identification of 1300+ languages, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (Association for Computational Linguistics, 2014). pp. 627–632. https://doi.org/10.3115/v1/D14-1069, https://aclanthology.org/D14-1069
R.D. Brown, Selecting and weighting n-grams to identify 1100 languages, in Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD 2013), Plzeň, Czech Republic (2013), pp. 475–483
R.D. Brown, Finding and identifying text in 900+ languages. Digit. Invest. 9, S34–S43 (2012)
Ç. Çöltekin, Dialect identification under domain shift: Experiments with discriminating Romanian and Moldavian, In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online) (International Committee on Computational Linguistics (ICCL), 2020), pp. 186–192. https://www.aclweb.org/anthology/2020.vardial-1.17
Ç. Çöltekin, T. Rama, Discriminating similar languages with linear SVMs and neural networks, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 15–24. https://aclanthology.org/W16-4802
Ç. Çöltekin, T. Rama, V. Blaschke, Tübingen-Oslo team at the VarDial 2018 evaluation campaign: an analysis of n-gram features in language variety identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018), pp. 55–65. https://aclanthology.org/W18-3906
G. Camposampiero, Q.A. Nguyen, F. Di Stefano, The curious case of logistic regression for Italian languages and dialects identification, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 86–98. https://aclanthology.org/2022.vardial-1.10
S. Carter, W. Weerkamp, M. Tsagkias, Microblog language identification: overcoming the limitations of short, unedited and idiomatic text. Lang. Res. Eval. 47(1), 195–215 (2013)
D. Castro, E. Souza, A.L.I. de Oliveira, Discriminating between Brazilian and European Portuguese National varieties on twitter texts, iIn Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS 2016) Recife, Pernambuco, Brazil (IEEE, 2016), pp. 265–270
D.W. Castro, E. Souza, D. Vitório, D. Santos, A.L.I. Oliveira, Smoothed N-gram based models for tweet language identification: a case study of the Brazilian and European Portuguese National varieties. Appl. Soft Comput. 61, 1160–1172 (2017)
W.B. Cavnar, J.M. Trenkle, N-Gram-based text categorization, in Proceedings of SDAIR-94, Third Annual Symposium on Document Analysis and Information Retrieval Las Vegas, USA (1994), pp. 161–175
J. Cazamias, C. Dixit, M. Marek, Large-Scale Language Classification - Writing a Detector for 200 Languages on Twitter. Stanford course report (2015)
A. Ceolin, Comparing the performance of CNNs and shallow models for language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 102–112. https://www.aclweb.org/anthology/2021.vardial-1.12
A. Ceolin, Neural networks for cross-domain language identification. phlyers @Vardial 2022, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 99–108. https://aclanthology.org/2022.vardial-1.11
A. Ceolin, H. Zhang, Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (International Committee on Computational Linguistics (ICCL), 2020), pp. 265–272. https://www.aclweb.org/anthology/2020.vardial-1.25
B.R. Chakravarthi, M. Gaman, R.T. Ionescu, H. Jauhiainen, T. Jauhiainen, K. Lindén, N. Ljubešić, N. Partanen, R. Priyadharshini, C. Purschke, E. Rajagopal, Y. Scherrer, M. Zampieri, Findings of the VarDial evaluation campaign 2021, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 1–11. https://www.aclweb.org/anthology/2021.vardial-1.1
J.C. Chang, C.-C. Lin, Recurrent-neural-network for Language Detection on Twitter Code-Switching Corpus (2014). arXiv:1412.4314
C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines. ACM Trans. Intel. Syst. Technol. (TIST) 2(3), 27 (2011)
S.F. Chen, B. Maison, Using place name data to train language identification models, in 8th European Conference on Speech Communication and Technology EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland (2003), pp. 1349–1352
S.F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4), 359–394 (1999)
K. Church, Stress assignment in letter to sound rules for speech synthesis, in 23rd Annual Meeting of the Association for Computational Linguistics, Chicago, Illinois, USA (Association for Computational Linguistics, 1985), pp. 246–253. https://doi.org/10.3115/981210.981240, https://aclanthology.org/P85-1030
K. Darwish, H. Sajjad, H. Mubarak, Verifiably effective Arabic dialect identification, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar (Association for Computational Linguistics, 2014), pp. 1465–1468. https://doi.org/10.3115/v1/D14-1154, https://aclanthology.org/D14-1154
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota (Association for Computational Linguistics, 2019), pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
P. Diaconis, R.L. Graham, Spearmam’s Footrule as a Measure of Disarray. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(2), 262–268 (1977)
N. Dongen, Analysis and Prediction of Dutch-English Code-switching in Dutch Social Media Messages. Master’s thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (2017)
Y. Doval, D. Vilares, J. Vilares, Automatic language identification in twitter: adapting state-of-the-art identifiers to the iberian context, in Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014), Girona, Spain (2014), pp. 39–43
S. Dowlagar, R. Mamidi, A pre-trained transformer and CNN model with joint language ID and part-of-speech tagging for code-mixed social-media text, in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Held (INCOMA Ltd, 2021), pp. 367–374. https://aclanthology.org/2021.ranlp-main.42
J. Dunn and W. Nijhof, Language identification for austronesian languages, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 6530–6539. https://aclanthology.org/2022.lrec-1.701
T. Dunning, Statistical Identification of Language. Technical Report MCCS 940-273, Computing Research Laboratory, New Mexico State University (1994)
A. Dutta, Word-level language identification using subword embeddings for code-mixed Bangla-English social media data, in Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 76–82. https://aclanthology.org/2022.dclrl-1.10
A. El Mekki, A. El Mahdaouy, K. Essefar, N. El Mamoun, I. Berrada, A. Khoumsi, BERT-based multi-task model for country and province level MSA and dialectal Arabic identification, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 271–275. https://www.aclweb.org/anthology/2021.wanlp-1.31
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871–1874 (2008)
H.-H. Franco-Penya, L. Mamani Sanchez, Tuning Bayes baseline for dialect detection, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 227–234. https://aclanthology.org/W16-4829
M. Franco-Salvador, N. Plotnikova, N. Pawar, Y. Benajiba, Subword-based deep averaging networks for author profiling – notebook for PAN at CLEF 2017, in L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl, ed. by Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017). CEUR-WS.org. http://ceur-ws.org/Vol-1866/
M. Franco-Salvador, P. Rosso, F. Rangel, Distributed representations of words and documents for discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 11–16. https://aclanthology.org/W15-5403
F. Gaim, W. Yang, J.C. Park, GeezSwitch: language identification in typologically related low-resourced East African languages, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022), pp. 6578–6584. https://aclanthology.org/2022.lrec-1.707
P. Gamallo, J.R. Pichel, I. Alegria, A perplexity-based method for similar languages discrimination, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 109–114. https://doi.org/10.18653/v1/W17-1213. https://aclanthology.org/W17-1213
S. Gella, K. Bali, M. Choudhury, “ye word kis lang ka hai bhai?” Testing the limits of word level language identification, in Proceedings of ICON-2014, the 11th International Conference on Natural Language Processing, Goa, India (2014)
M. Gemeda Yigezu, A. Lambebo Tonja, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbukh, Word level language identification in code-mixed Kannada-English texts using deep learning approach, in Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, IIIT Delhi, New Delhi, India (Association for Computational Linguistics, 2022), pp. 29–33. https://aclanthology.org/2022.icon-wlli.6
E. Giguet, Categorization according to language: a step toward combining linguistic knowledge and statistic learning, in Proceedings of the International Workshop on Parsing Technologies (IWPT’95), Prague - Karlovy Vary, Czech Republic (1995)
N. Gillin, Is encoder-decoder transformer the shiny hammer? in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 80–85. https://aclanthology.org/2022.vardial-1.9
O. Giwa, Language Identification for Proper Name Pronunciation. Ph.D. thesis, North-West University, Vaal Triangle (2016)
O. Giwa, M.H. Davel, Language identification of individual words with joint sequence models, in Proceedings of Interspeech 2014. Singapore (2014)
C. Goutte, S. Léger, Experiments in discriminating similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015), pp. 78–84. https://aclanthology.org/W15-5413
C. Goutte, S. Léger, M. Carpuat, The NRC system for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014), pp. 139–145. https://doi.org/10.3115/v1/W14-5316. URL https://aclanthology.org/W14-5316
G. Grefenstette, Comparing two language identification schemes, in Proceedings of the 3rd International conference on Statistical Analysis of Textual Data (JADT 1995), Rome, Italy (1995)
S. Gundapu, R. Mamidi, Word level language identification in English Telugu code mixed data, in Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation, Hong Kong (Association for Computational Linguistics, 2018). Accessed from 1–3 Dec. 2018. https://www.aclweb.org/anthology/Y18-1021
D.K. Gupta, S. Kumar, A. Ekbal, Machine learning approach for language identification & transliteration: shared task report of IITP-TS, in Forum for Information Retrieval Evaluation (FIRE) (Bangalore, India, 2014), pp.60–64
R. Haas, L. Derczynski, Discriminating between similar Nordic languages, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kiyv, Ukraine (Association for Computational Linguistics, 2021), pp. 67–75. https://aclanthology.org/2021.vardial-1.8
H. Haddad, A.C. Rouhou, A. Messaoudi, A. Korched, C. Fourati, A. Sellami, M. Ben HajHmida, F. Ghriss, TunBERT: pretraining BERT for Tunisian dialect understanding. SN Comput. Sci. 4(2), 194 (2023)
J. Häkkinen, J. Tian, N-gram and decision tree based language identification for written words, in Conference Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2001), Madonna di Campiglio, Italy (2001), pp.s 335–338
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update. ACM SIGKDD Explorations Newslett 11(1), 10–18 (2009)
A. Hamzah, Deteksi bahasa untuk dokumen teks berbahasa Indonesia, in Seminar Nasional Informatika, (semnasIF 2010), Jakarta. Indonesia 2010, A5–A13 (2010)
A. Hanani, A. Qaroush, S. Taylor, Classifying ASR transcriptions according to Arabic dialect, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 126–134. https://aclanthology.org/W16-4817
H. Hassanpour, M. AlyanNezhadi, M. Mohammadi, A signal processing method for text language identification. Int. J. Eng. 34(6), 1413–1418 (2021)
J. He, Z. Zhang, X. Zhao, P. Li, Y. Yan, Similar language identification for Uyghur and Kazakh on short spoken texts, in Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC 2016), vol. 2, Hangzhou, China (2016), pp. 496–499
R. Hecht-Nielsen, Theory of the backpropagation neural network, in Proceedings of the International Joint Conference on Neural Networks (IJCNN 1989), Washington, DC, USA (1989), pp. I593–I605. https://doi.org/10.1109/IJCNN.1989.118638
P. Henrich, Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a German text-to-speech system, in First European Conference on Speech Communication and Technology, Paris, France (1989), pp. 2220–2223
A.F. Hidayatullah, A. Qazi, D.T.C. Lai, R.A. Apong, A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development. IEEE Access 10, 122812–122831 (2022). https://doi.org/10.1109/ACCESS.2022.3223703
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
D.W. Hosmer, S. Lemeshow, R.X. Sturdivant, Applied logistic regression. Wiley Series in Probability and Statistics, 3rd edn. (Wiley, Hoboken, N.J., USA, 2013)
A.S. House, E.P. Neuburg, Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)
A. Hussain, M.U. Arshad, An Attention Based Neural Network for Code Switching Detection: English & Roman Urdu (2021). arXiv:2103.02252
D.-M. Iliescu, R. Grand, S. Qirko, R. van der Goot, Much gracias: semi-supervised code-switch detection for Spanish-English: how far can we get? in Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching (Association for Computational Linguistics, 2021), pp. 65–71. https://www.aclweb.org/anthology/2021.calcs-1.9
A. Jaech, G. Mulcaire, S. Hathi, M. Ostendorf, N.A. Smith, Hierarchical character-word models for language identification, in Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media, Austin, TX, USA (Association for Computational Linguistics, 2016a), pp. 84–93. https://doi.org/10.18653/v1/W16-6212. https://aclanthology.org/W16-6212
A. Jaech, G. Mulcaire, M. Ostendorf, N.A. Smith, A neural model for language identification in code-switched tweets, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas (Association for Computational Linguistics, 2016b), pp. 60–64. https://doi.org/10.18653/v1/W16-5807. https://aclanthology.org/W16-5807
R. Jalam Apprentissage Automatique et Catégorisation de Textes Multilingues. Ph.D. thesis, Université Lumière Lyon 2 (2003)
R. Jalam, O. Teytaud, Kernel-based text categorization, in Proceedings of the International Joint Conference on Neural Networks (IJCNN’01), vol. 3 Washington, DC, USA (2001a), pp. 1891–1896
R. Jalam, O. Teytaud, Identification de la Langue et Catégorisation de Textes Basées sur les N-grammes, in Journées Francophones d’extraction et de gestion de connaissances (EGC’2001). ed. by H. Briand, F. Guillet (France, Nantes, 2001), pp.227–238
T. Jauhiainen, Tekstin kielen automaattinen tunnistaminen. Master’s thesis, University of Helsinki, Helsinki (2010)
T. Jauhiainen, H. Jauhiainen, and K. Lindén, Optimizing naive Bayes for Arabic dialect identification, in Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates (Hybrid) (Association for Computational Linguistics, 2022c), pp. 409–414. https://aclanthology.org/2022.wanlp-1.40
T. Jauhiainen, H. Jauhiainen, K. Lindén, Discriminating similar languages with token-based backoff, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015b), pp. 44–51. https://www.aclweb.org/anthology/W15-5408
T. Jauhiainen, H. Jauhiainen, K. Lindén, Experiments in language variety geolocation and dialect identification, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online). (International Committee on Computational Linguistics (ICCL), 2020b), pp. 220–231. https://www.aclweb.org/anthology/2020.vardial-1.21
T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-based experiments in Swiss German dialect identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018b), pp. 254–262. https://www.aclweb.org/anthology/W18-3929
T. Jauhiainen, H. Jauhiainen, K. Lindén, HeLI-OTS, off-the-shelf language identifier for text, in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France (European Language Resources Association, 2022a), pp. 3912–3922. https://aclanthology.org/2022.lrec-1.416
T. Jauhiainen, H. Jauhiainen, K. Lindén, Italian language and dialect identification and regional French variety detection using adaptive naive Bayes, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022b), pp. 119–129. https://aclanthology.org/2022.vardial-1.13
T. Jauhiainen, H. Jauhiainen, K. Lindén, Iterative language model adaptation for Indo-Aryan language identification, in Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA (Association for Computational Linguistics, 2018a), pp. 66–75. https://www.aclweb.org/anthology/W18-3907
T. Jauhiainen, H. Jauhiainen, K. Lindén, Naive Bayes-based experiments in Romanian dialect identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021a), pp. 76–83 https://www.aclweb.org/anthology/2021.vardial-1.9
T. Jauhiainen, H. Jauhiainen, N. Partanen, K. Lindén, Uralic language identification (ULI) 2020 shared task dataset and the wanca 2017 corpora, in Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain (Online) (International Committee on Computational Linguistics (ICCL), 2020c), pp. 173–185. https://www.aclweb.org/anthology/2020.vardial-1.16
T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluating HeLI with non-linear mappings, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017b), pp. 102–108. https://doi.org/10.18653/v1/W17-1212. URL https://www.aclweb.org/anthology/W17-1212
T. Jauhiainen, K. Lindén, H. Jauhiainen, Evaluation of language identification methods using 285 languages, in Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothenburg, Sweden (Association for Computational Linguistics, 2017a), pp. 183–191. https://www.aclweb.org/anthology/W17-0221
T. Jauhiainen, K. Lindén, H. Jauhiainen, HeLI, a word-based backoff method for language identification, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 153–162. https://www.aclweb.org/anthology/W16-4820
T. Jauhiainen, K. Lindén, H. Jauhiainen, Language set identification in noisy synthetic multilingual documents, in Proceedings of the Computational Linguistics and Intelligent Text Processing 16th International Conference (CICLing 2015), Cairo, Egypt (2015c), pp. 633–643
T. Jauhiainen, M. Lui, M. Zampieri, T. Baldwin, K. Lindén, Automatic language identification in texts: a survey. J. Artif. Intell. Res. 65, 675–782 (2019e). ISSN 1076-9757. https://doi.org/10.1613/jair.1.11675
T. Jauhiainen, T. Ranasinghe, M. Zampieri, Comparing approaches to Dravidian language identification, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021c), pp. 120–127. https://www.aclweb.org/anthology/2021.vardial-1.14
T. Jo, Neural text categorizer for exclusive text categorization. J. Inf. Process. Syst. 4, 77–86 (2008)
M.I. Jordan, Serial order: a parallel Distributed Processing Approach. Technical report, Institute for Cognitive Science, University of California, San Diego (1986)
A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain (Association for Computational Linguistics, 2017), pp. 427–431. https://www.aclweb.org/anthology/E17-2068
D. Jurgens, Y. Tsvetkov, D. Jurafsky, incorporating dialectal variability for socially equitable language identification, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada (Association for Computational Linguistics, 2017), pp. 51–57. https://doi.org/10.18653/v1/P17-2009. URL https://aclanthology.org/P17-2009
S. Kadri and A. Moussaoui. An Effective Method to Recognize the Language of a Text in a Collection of Multilingual Documents. In Proceedings of the International Conference on Electronics, Computer and Computation (ICECCO 2013), pages 208–211, Ankara, Turkey, 2013
N.J. Kalita, A.G. Agarwala, J. Das, Word level language identification on code-mixed English-Bodo text, in IOP Conference Series: Materials Science and Engineering, vol. 1020 (IOP Publishing, 2021), p. 012027
C.M. Kastner, G.A. Covington, A.A. Levine, J.W. Lockwood, Hail: a hardware-accelerated algorithm for language identification, in T. Rissa, S. Wilton, P. Leong, ed. by Proceedings of the 2005 International Conference on Field Programmable Logic and Applications (FPL), Tampere, Finland (2005), pp. 499–504
T. Kerwin, Classification of Natural Language Based on Character Frequency. (Ohio Supercomputer Center, 2006)
E. Ketzan, N. Werner, ‘entrez!’she called: evaluating language identification tools in English literary texts, in Proceedings of the Computational Humanities Research Conference 2022 (CHR 2022), Antwerp, Belgium (2022), pp. 366–373
L. Kevers, CoSwID, a code switching identification method suitable for under-resourced languages, in Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, Marseille, France (European Language Resources Association, 2022), pp. 112–121. https://aclanthology.org/2022.sigul-1.15
S. Kim, J. Park, Automatic Detection of Character Encoding and Language. Technical report, Stanford University (2007)
B.P. King, Practical Natural Language Processing for Low-Resource Languages. Ph.D, thesis, University of Michigan (2015)
L. King, S. Kübler, W. Hooper, Word-level language identification in The Chymistry of Isaac Newton. Digit. Scholarsh. Humanit. 30(4), 532–540 (2015)
T. Kocmi, O. Bojar, LanideNN: Multilingual language identification on character window, in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain (Association for Computational Linguistics, 2017), pp. 927–936. https://aclanthology.org/E17-1087
D. Kosmajac, Author and Language Profiling of Short Texts. Ph.D. thesis, Dalhausie University (2020)
C. Kruengkrai, P. Srichaivattana, V. Sornlertlamvanich, H. Isahara, Language identification based on string kernels, in Proceedings of the 5th International Symposium on Communications and Information Technologies (ISCIT-2005), vol. 2, Beijing, China (2005), pp. 896–899
S. Kulikowski, Language Identification of Short Texts (The University of West Florida, Newsgroup article, Educational Research and Development Center, 1991)
A. Lambebo Tonja, M. Gemeda Yigezu, O. Kolesnikova, M. Shahiki Tash, G. Sidorov, A. Gelbuk, Transformer-based model for word level language identification in code-mixed Kannada-English texts (2022). arXiv:2211.14459
S. Leidig, Single and Combined Features for the Detection of Anglicisms in German and Afrikaans. Bachelor’s Thesis, Karlsruhe Institute of Technology (2014)
Y. Li, T. Baldwin, T. Cohn, What’s in a domain? Learning domain-robust text representations using adversarial training, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational. Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana (Association for Computational Linguistics, 2018), pp. 474–479. https://doi.org/10.18653/v1/N18-2076. URL https://aclanthology.org/N18-2076
N. Ljubešić, D. Kranjcić, Discriminating between VERY similar languages among twitter users, in Proceedings of the 9th Language Technologies Conference, Ljubljana, Slovenia (2014), pp. 90–94
A.F. Llitjós, Improving Pronunciation Accuracy of Proper Names with Language Origin Classes. Master’s thesis, Carnegie Mellon University, Pittsburgh, PA, USA (2001)
M. Lui, T. Baldwin, Accurate language identification of Twitter messages, in Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden (Association for Computational Linguistics, 2014), pp. 17–25. https://doi.org/10.3115/v1/W14-1303. https://aclanthology.org/W14-1303
M. Lui, T. Baldwin, langid.py: an off-the-shelf language identification tool, in Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea (Association for Computational Linguistics, 2012), pp. 25–30. https://aclanthology.org/P12-3005
M. Lui, J.H. Lau, T. Baldwin, Automatic detection and language identification of multilingual documents. Trans. Assoc. Comput. Linguist. 2, 27–40 (2014)
S. MacNamara, P. Cunningham, J. Byrne, Neural networks for language identification: a comparative study. Inf. Process. Manag. 34(4), 395–403 (1998)
M. Majliš, Large Multilingual Corpus. Master’s thesis, Charles University in Prague, Prague (2011)
M. Majliš, Yet another language identifier, in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (Association for Computational Linguistics, 2012), pp. 46–54. https://aclanthology.org/E12-3006
M. Majliš, Z. Žabokrtský, Language richness of the web, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), , Istanbul, Turkey (European Language Resources Association (ELRA), 2012), pp. 2927–2934. http://www.lrec-conf.org/proceedings/lrec2012/pdf/267_Paper.pdf
S. Malmasi, M. Dras, Automatic language identification for Persian and Dari texts, in Proceedings of the 14th Conference of the Pacific Association for Computational Linguistics, PACLING’15, Bali, Indonesia (2015a), pp. 59–64
S. Malmasi, M. Dras, Feature hashing for language and dialect identification, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada (Association for Computational Linguistics, 2017), pp. 399–403. https://doi.org/10.18653/v1/P17-2063. https://aclanthology.org/P17-2063
S. Malmasi, M. Dras., Language identification using classifier ensembles, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015b), pp. 35–43. https://aclanthology.org/W15-5407
S. Malmasi, M. Zampieri, Arabic dialect identification in speech transcripts, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 106–113. https://aclanthology.org/W16-4814
S. Malmasi, M. Zampieri, Arabic dialect identification using iVectors and ASR transcripts, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017b), pp. 178–183. https://doi.org/10.18653/v1/W17-1222. URL https://aclanthology.org/W17-1222
S. Malmasi, M. Zampieri, German dialect identification in interview transcriptions, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017a), pp. 164–169. https://doi.org/10.18653/v1/W17-1220, https://aclanthology.org/W17-1220
S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task, in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan (The COLING 2016 Organizing Committee, 2016), pp. 1–14. https://www.aclweb.org/anthology/W16-4801
P. Martadinata, B.D. Trisedya, H.M. Manurung, M. Adriani, Building Indonesian local language detection tools using wikipedia data, in Y. Murakami, D. Lin, ed. by Worldwide Language Service Infrastructure (Springer, 2016), pp. 113–123
M. Martinc, I. Škrjanec, K. Zupan, S. Pollak, PAN, author profiling - gender and language variety prediction-notebook for pan at CLEF, in 2017 Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017)
P. Mathur, A. Misra, E. Budur, LIDE: Language Identification from Text Documents (2017). arXiv:1701.03682
A.K. McCallum, Mallet: A machine learning for language toolkit (2002). http://mallet.cs.umass.edu
P. McNamee, Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20(3), 94–101 (2005)
M. Medvedeva, M. Kroon, B. Plank, When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain (Association for Computational Linguistics, 2017), pp. 156–163. https://doi.org/10.18653/v1/W17-1219, https://aclanthology.org/W17-1219
S. Mehta, T. Jain, N. Aggarwal, Multilingual short text analysis of twitter using random forest approach, in B. Villazón-Terrazas, F. Ortiz-Rodríguez, S. Tiwari, A. Goyal, M. Jabbar, ed. by Knowledge graphs and semantic web (Springer International Publishing, Cham, 2021), pp. 84–92. ISBN 978-3-030-91305-2
I. Mendoza, J. Mendelsohn, Exploring Techniques in Distinguishing Similar Languages. Stanford course project (2017)
A. Miletic, Y. Scherrer. Ocwikidisc: a corpus of wikipedia talk pages in occitan, in Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, , Gyeongju, Republic of Korea (Association for Computational Linguistics, 2022), pp. 70–79. https://aclanthology.org/2022.vardial-1.8
A. Moodley, Language Identification With Decision Trees: Identification Of Individual Words In The South African Languages. Bachelor’s Thesis, University of South Africa (2016)
A. Mukherjee, A. Ravi, K. Datta, Mixed-script query labelling using supervised learning and ad hoc retrieval using sub word indexing, in FIRE ’14 Proceedings of the Forum for Information Retrieval, Bangalore, India (2014), pp. 86–90
S. Mustonen, Multiple discriminant analysis in linguistic problems. Stat. Methods Linguist. 4, 37–44 (1965)
H. Ney, U. Essen, R. Kneser, On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)
A.Y. Ng, M.I. Jordan, On Discriminative vs. Generative classifiers: a comparison of logistic regression and naive Bayes, in Advances in Neural Information Processing Systems 15 (NIPS 2002). ed. by S. Becker, S. Thrun, K. Obermayer (Vancouver, British Columbia, Canada, 2002), pp.841–848
NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G.M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K.R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N.F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No language left behind: Scaling human-centered machine translation (2022). https://arxiv.org/abs/2207.04672
B. Okgetheng, E.A. Budu, Word-based bantu language identification using naïve bayes, in 2022 IST-Africa Conference (IST-Africa) (2022), pp. 1–7. https://doi.org/10.23919/IST-Africa56635.2022.9845618
G.H. Paetzold, M. Zampieri, Experiments in cuneiform language identification, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 209–213. https://doi.org/10.18653/v1/W19-1423. URL https://www.aclweb.org/anthology/W19-1423
S. Patel, V. Desai, LIGA and syllabification approach for language identification and back transliteration: a shared task report by DA-IICT, in FIRE ’14 Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India (2014), pp. 43–47
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
F. Peng, D. Schuurmans, Combining Naive Bayes and n-Gram language models for text classification, in Proceedings of the 25th European Conference on IR Research, Advances in Information Retrieval: (ECIR 2003), Pisa, Italy (Springer, Berlin, Heidelberg, 2003), pp. 335–350. ISBN 978-3-540-36618-8, https://doi.org/10.1007/3-540-36618-0_24, http://dx.doi.org/10.1007/3-540-36618-0_24
J. Porta, J.-L. Sancho, Using maximum entropy models to discriminate between similar languages and varieties, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland (Association for Computational Linguistics and Dublin City University, 2014), pp. 120–128. https://doi.org/10.3115/v1/W14-5314. https://aclanthology.org/W14-5314
A. Poutsma, Applying Monte Carlo techniques to language identification. Lang. Comput. 45(1), 179–189 (2002)
J.M. Prager, Linguini: language identification for multilingual documents, in Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA (1999)
Y. Qu, G. Grefenstette, Finding ideographic representations of Japanese names written in Latin script via language identification and corpus validation, in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain (2004), pp. 183–190. https://doi.org/10.3115/1218955.1218979, https://aclanthology.org/P04-1024
F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th Author Profiling Task at PAN 2017: gender and language variety identification in twitter, in L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl, ed. by Working Notes Papers of CLEF 2017 Evaluation Labs and Workshop, Dublin, Ireland (2017). CEUR-WS.org. http://ceur-ws.org/Vol-1866/
M.D. Rau, Language Identification by Statistical Analysis. Master’s thesis, Naval Postgraduate School, Monterey (1974)
C. Sabty, I. Mesabah, Özlem Çetinoğlu, S. Abdennadher, Language identification of intra-word code-switching for Arabic-English. Array 12, 100104 (2021). ISSN 2590-0056. https://doi.org/10.1016/j.array.2021.100104. https://www.sciencedirect.com/science/article/pii/S2590005621000473
Y. Samih, S. Maharjan, M. Attia, L. Kallmeyer, T. Solorio, Multilingual code-switching identification via LSTM recurrent neural networks, in Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas (Association for Computational Linguistics, 2016), pp. 50–59. https://doi.org/10.18653/v1/W16-5806. URL https://aclanthology.org/W16-5806
Y. Samih, W. Maier, Detecting code-switching in Moroccan Arabic social media, in Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2016 IJCAI), New York City, USA (2016)
N. Sarma, Automatic Language Identification in Online Multilingual Conversations. Ph.D. thesis, Indian Institute of Technology Guwahati (2021)
N. Sarma, R. Sanasam Singh, D. Goswami, Switchnet: learning to switch for word-level language identification in code-mixed social media text. Nat Lang Eng 28(3), 337–359 (2022). https://doi.org/10.1017/S1351324921000115
F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
A. Selamat, C.-C. Ng, Y. Mikami, Arabic script web documents language identification using decision tree-ARTMAP model, in Proceedings of the International Conference on Convergence Information Technology (ICCIT 2007), Gyeongju, Korea. (IEEE, 2007), pp. 717–722
R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016), pp. 1715–1725
K. Shaffer, Language clustering for multilingual named entity recognition, in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic (Association for Computational Linguistics, 2021), pp. 40–45. https://aclanthology.org/2021.findings-emnlp.4
P. Shrestha, Incremental N-gram approach for language identification in code-switched text, in Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar (Association for Computational Linguistics, 2014), pp. 133–138. https://doi.org/10.3115/v1/W14-3916, https://aclanthology.org/W14-3916
P. Sibun, J.C. Reynar, Language identification: examining the issues, in Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-96), Las Vegas, USA (1996), pp. 125–135
A.K. Singh, Modeling and Application of Linguistic Similarity. Ph.D. thesis, International Institute of Information Technology, Hyderabad (2010)
A.K. Singh, Study of some distance measures for language and encoding identification, in Proceedings of the Workshop on Linguistic Distances, Sydney, Australia (2006), pp. 63–72
X. Song, A. Salcianu, Y. Song, D. Dopson, D. Zhou, Fast wordpiece tokenization, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021), pp. 2089–2103
C. Souter, G. Churcher, J. Hayes, J. Hughes, S. Johnson, Natural Language Identification using Corpus-Based Models. Hermes, J. Linguist. 13, 183–203 (1994)
A. Stensby, B.J. Oommen, O.-C. Granmo, Language detection and tracking in multilingual documents using weak estimators, in Proceedings of the Joint IAPR International Workshop Structural, Syntactic, and Statistical Pattern Recognition (SSPR &SPR 2010), Cesme, Izmir, Turkey (2010), pp. 600–609
A. Stolcke, SRILM - an extensible language modeling toolkit, in Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP-2002), Denver, Colorado, USA (2002), pp. 901–904
H. Takçı, T. Güngör, A high performance centroid-based classification approach for language identification. Pattern Recognit. Lett. 33(16), 2077–2084 (2012)
L. Tan, M. Zampieri, N. Ljubešić, J. Tiedemann, Merging comparable data sources for the discrimination of similar languages: the DSL corpus collection, in Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland (2014)
W.J. Teahan, Text classification and segmentation using minimum cross-entropy, in Proceedings of the 6th International Conference Recherche d’Information Assistee par Ordinateur (RIAO’00), Paris, France (2000), pp. 943–961
S. Thara, P. Poornachandran, Transformer based language identification for Malayalam-English code-mixed text. IEEE Access 9, 118837–118850 (2021). https://doi.org/10.1109/ACCESS.2021.3104106
J. Tian, J. Häkkinen, S. Riis, K.J. Jensen, On text-based language identification for multilingual speech recognition systems, in Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP2002), Denver, Colorado, USA (2002)
J. Tian, J. Suontausta, Scalable neural network based language identification from written text, in Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), vol. 1, Hong Kong (2003), pp. 48–51
M. Toftrup, S. Asger Sørensen, M.R. Ciosici, I. Assent, A reproduction of Apple’s bi-directional LSTM models for language identification in short strings, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop (Association for Computational Linguistics, 2021), pp. 36–42. https://www.aclweb.org/anthology/2021.eacl-srw.6
E. Tromp, Multilingual Sentiment Analysis on Social Media. Master’s thesis, Eindhoven University of Technology, Eindhoven (2011)
E. Tromp, M. Pechenizkiy, Graph-based N-gram language identification on short texts, in Proceedings of the 20th Annual Belgian Dutch Conference on Machine Learning (Benelearn 2011), The Hague, Netherlands (2011), pp. 27–34
D. Tudoreanu, DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification, in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan (Association for Computational Linguistics, 2019), pp. 202–208. https://doi.org/10.18653/v1/W19-1422. https://aclanthology.org/W19-1422
P. v. Cann, Dialect Identification on Twitter: A Research About the Detection of the Limburgian Dialect from Twitter messages. Master’s thesis, University of Tilburg (2015)
T. Vatanen, J.J. Väyrynen, S. Virpioja, Language identification of short text segments with n-gram models, in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta (European Language Resources Association (ELRA), 2010). http://www.lrec-conf.org/proceedings/lrec2010/pdf/279_Paper.pdf
J. Vogel, D. Tresner-Kirsch, Robust language identification in short, noisy texts: improvements to LIGA, in M. Atzmueller, H. Andreas ed. by Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE), Bristol, UK (2012), pp. 43–50
A. Wadhawan, Dialect identification in nuanced Arabic tweets using farasa segmentation and AraBERT, in Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine (Virtual) (Association for Computational Linguistics, 2021), pp. 291–295. https://aclanthology.org/2021.wanlp-1.35
M. Watson, C. Qian, J. Bischof, F. Chollet, et al., Kerasnlp (2022). https://github.com/keras-team/keras-nlp
I. Weber, Language identification in a highly unbalanced dataset. Master’s thesis, Stellenbosch University (2022)
D. Widdows, C. Brew, Language identification with a reciprocal rank classifier (2021). arXiv:2109.09862
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques, 4th edn. (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2016), p.0128042915
D.H. Wolpert, Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
M. Yasir, L. Chen, A. Khatoon, M.A. Malik, F. Abid, Mixed script identification using automated DNN Hyperparameter optimization (Comput. Intell, Neurosci, 2021)
J.-L. You, Y.-N. Chen, M. Chu, F.K. Soong, J.-L. Wang, Identifying language origin of named entity with multiple information sources. IEEE Trans Audio, Speech Lang Process 16(6), 1077–1086 (2008)
G.-E. Zaharia, A.-M. Avram, D.-C. Cercel, T. Rebedea, Dialect identification through adversarial learning and knowledge distillation on Romanian BERT, in Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, Kyiv, Ukraine (Association for Computational Linguistics, 2021), pp. 113–119. https://aclanthology.org/2021.vardial-1.13
J.D. Zamora, A.F. Bruzòn, R.O. Bueno, Tweets language identification using feature weighting, in Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014) Girona, Spain (2014), pp. 30–34
M. Zampieri, B. G. Gebre, H. Costa, J. van Genabith, Comparing approaches to the identification of similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, Hissar, Bulgaria (Association for Computational Linguistics, 2015a), pp. 66–72. https://aclanthology.org/W15-5411
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Features and Methods. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-45822-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45821-7
Online ISBN: 978-3-031-45822-4
eBook Packages: Synthesis Collection of Technology (R0)
