Skip to main content

Applications and Related Tasks

  • Chapter
  • First Online:
Automatic Language Identification in Texts

Abstract

In the first section of this chapter, we showcase some of the applications that have traditionally incorporated language identification. In effect, this encompasses all “mixed monolingual” NLP tasks, in routing instances to the monolingual model appropriate to the source language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 44.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pan.webis.de/shared-tasks.html#authorship-attribution.

  2. 2.

    https://pan.webis.de/clef11/pan11-web/authorship-attribution.html.

References

  • S. Argamon, P. Juola, Overview of the international authorship identification competition at PAN-2011, in CLEF (Notebook Papers/Labs/Workshop) (2011)

    Google Scholar 

  • A. Babhulgaonkar, S. Sonavane, Language identification for multilingual machine translation, in 2020 International Conference on Communication and Signal Processing (ICCSP) (2020), pp. 401–405. https://doi.org/10.1109/ICCSP48568.2020.9182184

  • D. Bagnall, Author identification using multi-headed recurrent neural networks, in Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum, ed. by L. Cappellato, N. Ferro, G. Jones, E.S. Juan (CEUR-WS.org, Toulouse, France, 2015). https://ceur-ws.org/Vol-1391/150-CR.pdf

  • K.R. Beesley, Language identifier: a computer program for automatic natural-language identification of on-line text, in Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads, Seattle, USA (1988), pp. 47–54

    Google Scholar 

  • Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) (Association for Computational Linguistics, Valencia, Spain, 2017), pp. 115–123. https://doi.org/10.18653/v1/W17-1214. https://aclanthology.org/W17-1214

  • J. Bevendorff, B. Chulvi, G.L. De La Peña Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wolska, E. Zangerle, Overview of pan 2021: authorship verification, profiling hate speech spreaders on twitter, and style change detection, in Experimental IR Meets Multilinguality, Multimodality, and Interaction, ed. by K.S. Candan, B. Ionescu, L. Goeuriot, B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Springer International Publishing, Cham, 2021), pp. 419–431. ISBN 978-3-030-85251-1

    Google Scholar 

  • D. Blanchard, J. Tetreault, D. Higgins, A. Cahill, M. Chodorow, TOEFL11: a corpus of non-native English. ETS Res Report Ser d 2013(2), i–15 (2013)

    Google Scholar 

  • B. Boenninghoff, R.M. Nickel, D. Kolossa, O2D2: out-of-distribution detector to capture undecidable trials in authorship verification, in Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania (2021)

    Google Scholar 

  • B. Boenninghoff, J. Rupp, R.M. Nickel, D. Kolossa, Deep Bayes Factor. Scoring, for authorship verification, in Working Notes of CLEF, Conference and Labs of the Evaluation Forum (Thessaloniki, Greece, 2020), p. 2020

    Google Scholar 

  • J. Brooke, G. Hirst, Robust, lexicalized native language identification. In: Proceedings of COLING (2012), pp. 391–408

    Google Scholar 

  • A. Cimino, F. Dell’Orletta, Stacked sentence-document classifier approach for improving native language identification, in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications (Association for Computational Linguistics, Copenhagen, Denmark, 2017), pp. 430–437. https://doi.org/10.18653/v1/W17-5049. https://aclanthology.org/W17-5049

  • M. Coulthard, Author identification, idiolect, and linguistic uniqueness. Appl. Ling. 25(4), 431–447 (2004). ISSN 0142-6001. https://doi.org/10.1093/applin/25.4.431

  • J.E. Custódio, I. Paraboni, EACH-USP ensemble cross-domain authorship attribution, in Working Notes of CLEF, Conference and Labs of the Evaluation Forum (Avignon, France, 2018), p. 2018

    Google Scholar 

  • B.G. Gebre, M. Zampieri, P. Wittenburg, T. Heskes, Improving native language identification with TF-IDF weighting, in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (Association for Computational Linguistics, Atlanta, Georgia, 2013), pp. 216–223. https://aclanthology.org/W13-1728

  • H. Gómez-Adorno, Y. Alemán, D. Vilariño, M.A. Sanchez-Perez, D. Pinto, G. Sidorov, Author clustering using hierarchical Clustering analysis: notebook for PAN at CLEF 2017, in CEUR Workshop Proceedings, vol. 1866 (CEUR-WS, 2017)

    Google Scholar 

  • C. Goutte, S.Léger, M. Carpuat, Feature space selection and combination for native language identification, in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (2013), pp. 96–100

    Google Scholar 

  • C. Goutte, S. Léger, M. Carpuat, The NRC system for discriminating similar languages, in Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (Association for Computational Linguistics and Dublin City University, Dublin, Ireland, 2014), pp. 139–145. https://doi.org/10.3115/v1/W14-5316. https://aclanthology.org/W14-5316

  • S. Granger, E. Dagneaux, F. Meunier, M. Paquot, et al., International Corpus of Learner English (Presses universitaires de Louvain Louvain-la-Neuve, 2009)

    Google Scholar 

  • C. Grozea, Brainsignals submission to plant identification task at ImageCLEF 2012, in CLEF (Online Working Notes/Labs/Workshop) (Citeseer, 2012)

    Google Scholar 

  • D.-M. Iliescu, R. Grand, S. Qirko, R. van der Goot, Much gracias: semi-supervised code-switch detection for Spanish-English: how far can we get?, in Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, June 2021. Association for Computational Linguistics, pp. 65–71. https://www.aclweb.org/anthology/2021.calcs-1.9

  • R.T. Ionescu, A fast algorithm for local rank distance: application to arabic native language identification, in International Conference on Neural Information Processing (Springer, 2015), pp. 390–400

    Google Scholar 

  • S. Jarvis, Y. Bestgen, S. Pepper, Maximizing classification accuracy in native language identification, in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (2013), pp. 111–118

    Google Scholar 

  • S. Jarvis, S.A. Crossley, Approaching Language Transfer Through Text Classification: Explorations in the Detection based Approach, vol. 64. Multilingual Matters (2012)

    Google Scholar 

  • P. Juola, An overview of the traditional authorship attribution subtask, in CLEF (Online Working Notes/Labs/Workshop) (Citeseer, 2012)

    Google Scholar 

  • P. Juola, E. Stamatatos, Overview of the author identification task at PAN 2013, in CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 23–26 September, Valencia, Spain, ed. by P. Forner, R. Navigli, D. Tufis (CEUR-WS.org, 2013). ISBN 978-88-904810-3-1. http://ceur-ws.org/Vol-1179

  • M. Kestemont, W. Daelemans, M. Tschuggnall, G. Specht, E. Stamatatos, B. Stein, M. Potthast, Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection, in CEUR Workshop Proceedings (2018)

    Google Scholar 

  • M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, M. Potthast, B. Stein, Overview of the cross-domain authorship verification task at PAN 2020, in CLEF (2020)

    Google Scholar 

  • M. Kestemont, E. Manjavacas, I. Markov, J. Bevendorff, M. Wiegmann, E. Stamatatos, B. Stein, M. Potthast, Overview of the cross-domain authorship verification task at PAN 2021, in CLEF (Working Notes) (2021)

    Google Scholar 

  • M. Kestemont, E. Stamatatos, E. Manjavacas, W. Daelemans, M. Potthast, B. Stein, Overview of the Cross-domain Authorship Attribution Task at PAN 2019, in CLEF (Working Notes) (2019)

    Google Scholar 

  • M. Khonji, Y. Iraqi, A slightly-modified GI-based author-verifier with lots of features (ASGALF). CLEF (Working Notes) 1180, 977–983 (2014)

    Google Scholar 

  • M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution. J. Amer. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009). ISSN 1532-2882

    Google Scholar 

  • S. Malmasi, I. del Río, M. Zampieri, Portuguese native language identification, in International Conference on Computational Processing of the Portuguese Language (Springer, 2018), pp. 115–124

    Google Scholar 

  • S. Malmasi, M. Dras, Finnish native language identification, in Proceedings of the Australasian Language Technology Association Workshop (2014), pp. 139–144

    Google Scholar 

  • S. Malmasi, K. Evanini, A. Cahill, J. Tetreault, R. Pugh, C. Hamill, D. Napolitano, Y. Qian, A report on the 2017 native language identification shared task, in Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications (Association for Computational Linguistics, Copenhagen, Denmark, 2017), pp. 62–75. https://doi.org/10.18653/v1/W17-5007

  • T. Mizumoto, Y. Hayashibe, K. Sakaguchi, M. Komachi, Y. Matsumoto, NAIST at the NLI 2013 shared task, in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (Association for Computational Linguistics, Atlanta, Georgia, 2013), pp. 134–139. https://aclanthology.org/W13-1717

  • L. Muttenthaler, G. Lucas, J. Amann, Authorship attribution, in fan-fictional texts given variable length character and word N-grams, in Working Notes of CLEF, Conference and Labs of the Evaluation Forum (Lugano, Switzerland, 2019), p. 2019

    Google Scholar 

  • B. Parlak, A.K. Uysal, The effects of globalisation techniques on feature selection for text classification. J. Inf. Sci. (2020). https://doi.org/10.1177/0165551520930897

  • X. Ren, B. Yang, D. Liu, H. Zhang, X. Lv, L. Yao, J. Xie, Effective approaches to neural query language identification. Comput. Linguist. 48(4), 887–906 (2022). ISSN 0891-2017. https://doi.org/10.1162/coli_a_00451

  • P. Rosso, F. Rangel, M. Potthast, E. Stamatatos, M. Tschuggnall, B. Stein, Overview of PAN 2016—new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation, in Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF 2016), ed. by N. Fuhr, P. Quaresma, B. Larsen, T. Gonçalves, K. Balog, C. Macdonald, L. Cappellato, N. Ferro (Springer, Berlin, Heidelberg, New York, 2016). ISBN 978-3-319-44564-9. https://doi.org/10.1007/978-3-319-44564-9_28

  • R.S. Roy, M. Choudhury, P. Majumder, K. Agarwal, Overview of the FIRE 2013 track on transliterated search, in Proceedings of the 5th Forum on Information Retrieval Evaluation (FIRE ’13), ed. by P. Majumder, M. Mitra, M. Agrawal, P. Mehta (ACM, New Delhi, India, 2013)

    Google Scholar 

  • F. Sebastiani, Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  • S. Seidman, Authorship verification using the impostors method, in CLEF 2013 Evaluation Labs and Workshop–Working Notes Papers (Citeseer, 2013), pp. 23–26

    Google Scholar 

  • R. Sequeira, M. Choudhury, P. Gupta, P. Rosso, S. Kumar, S. Banerjee, S.K. Naskar, S. Bandyopadhyay, G. Chittaranjan, A. Das, K. Chakma, Overview of FIRE-2015 shared task on mixed script information retrieval, in Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2015) (Gandhinagar, India, 2015), pp. 21–27

    Google Scholar 

  • S. Sharma, V. Huddar, I. Aggarwal, N. Khoriya, V. Narayanan, A. Saroop, R. Bhagat, Query language identification with weak supervision and noisy label pruning, in The Web Conference 2021 Workshop on Multilingual Search (2021). https://www.amazon.science/publications/query-language-identification-with-weak-supervision-and-noisy-label-pruning

  • E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López-López, M. Potthast, B. Stein, Overview of the author identification task at PAN 2015, in CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8–11 September, Toulouse, France, ed. by L. Cappellato, N. Ferro, G. Jones, E. San Juan (CEUR-WS.org, 2015). http://ceur-ws.org/Vol-1391

  • E. Stamatatos, M. Kestemont, K. Kredens, P. Pezik, A. Heini, J. Bevendorff, M. Potthast, B. Stein, Overview of the authorship verification task at PAN 2022, in Working Notes of CLEF (2022)

    Google Scholar 

  • E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. López-López, M. Potthast, B. Stein, Overview of the author identification task at pan 2014. CLEF (Working Notes) 1180, 877–897 (2014)

    Google Scholar 

  • J. Tetreault, D. Blanchard, A. Cahill, A report on the first native language identification shared task, in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications(Association for Computational Linguistics, Atlanta, Georgia, 2013), pp. 48–57. https://www.aclweb.org/anthology/W13-1706

  • M. Tschuggnall, E. Stamatatos, B. Verhoeven, W. Daelemans, G. Specht, B. Stein, M. Potthast, Overview of the author identification task at pan-2017: style breach detection and author clustering, in CLEF (Working Notes) (2017)

    Google Scholar 

  • M. Zampieri, B.G. Gebre, H. Costa, J. van Genabith, Comparing approaches to the identification of similar languages, in Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (Association for Computational Linguistics, Hissar, Bulgaria, 2015), pp. 66–72. https://aclanthology.org/W15-5411

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommi Jauhiainen .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Jauhiainen, T., Zampieri, M., Baldwin, T., Lindén, K. (2024). Applications and Related Tasks. In: Automatic Language Identification in Texts. Synthesis Lectures on Human Language Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-45822-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45822-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45821-7

  • Online ISBN: 978-3-031-45822-4

  • eBook Packages: Synthesis Collection of Technology (R0)

Publish with us

Policies and ethics