Advertisement

Portuguese Native Language Identification

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11122)

Abstract

This study presents the first Native Language Identification (NLI) study for L2 Portuguese. We used a sub-set of the NLI-PT dataset, containing texts written by speakers of five different native languages: Chinese, English, German, Italian, and Spanish. We explore the linguistic annotations available in NLI-PT to extract a range of (morpho-)syntactic features and apply NLI classification methods to predict the native language of the authors. The best results were obtained using an ensemble combination of the features, achieving \(54.1\%\) accuracy.

Keywords

Native Language Identification Learner corpus Portuguese 

Notes

Acknowledgements

We would like to thank the anonymous reviewers for the suggestions and constructive feedback provided.

References

  1. 1.
    Malmasi, S.: Native language identification: explorations and applications. Ph.D. thesis (2016)Google Scholar
  2. 2.
    Malmasi, S., Dras, M.: Multilingual native language identification. In: Natural Language Engineering (2015)Google Scholar
  3. 3.
    Malmasi, S., Dras, M.: Chinese native language identification. In: Proceedings of EACL. Association for Computational Linguistics, Gothenburg (2014)Google Scholar
  4. 4.
    Malmasi, S., Dras, M., Temnikova, I.: Norwegian native language identification. In: Proceedings of RANLP, Hissar, Bulgaria, pp. 404–412, September 2015Google Scholar
  5. 5.
    Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the Arabic Natural Language Processing Workshop (2014)Google Scholar
  6. 6.
    Block, D., Cameron, D.: Globalization and Language Teaching. Routledge, Abingdon (2002)CrossRefGoogle Scholar
  7. 7.
    Martins, R.T., Hasegawa, R., Nunes, M.G.V., Montilha, G., De Oliveira, O.N.: Linguistic issues in the development of ReGra: a grammar checker for Brazilian Portuguese. Nat. Lang. Eng. 4(4), 287–307 (1998)CrossRefGoogle Scholar
  8. 8.
    Elliot, S.: IntelliMetric: From here to validity. In: A Cross-Disciplinary Perspective, Automated Essay Scoring, pp. 71–86 (2003)Google Scholar
  9. 9.
    Baptista, J., Costa, N., Guerra, J., Zampieri, M., Cabral, M., Mamede, N.: P-AWL: academic word list for Portuguese. In: Pardo, T.A.S., Branco, A., Klautau, A., Vieira, R., de Lima, V.L.S. (eds.) PROPOR 2010. LNCS (LNAI), vol. 6001, pp. 120–123. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-12320-7_15CrossRefGoogle Scholar
  10. 10.
    Mendes, A., Antunes, S., Janssen, M., Gonçalves, A.: The COPLE2 corpus: a learner corpus for Portuguese. In: Proceedings of LREC (2016)Google Scholar
  11. 11.
    Wong, S.M.J., Dras, M.: Contrastive analysis and native language identification. In: Proceedings of ALTA, Sydney, Australia, pp. 53–61, December 2009Google Scholar
  12. 12.
    Wong, S.M.J., Dras, M.: Exploiting parse structures for native language identification. In: Proceedings of EMNLP (2011)Google Scholar
  13. 13.
    Swanson, B., Charniak, E.: Native language detection with tree substitution grammars. In: Proceedings of ACL, Jeju Island, Korea, pp. 193–197, July 2012Google Scholar
  14. 14.
    Tetreault, J., Blanchard, D., Cahill, A., Chodorow, M.: Native tongues, lost and found: resources and empirical evaluations in native language identification. In: Proceedings of COLING, Mumbai, India, pp. 2585–2602 (2012)Google Scholar
  15. 15.
    Gebre, B.G., Zampieri, M., Wittenburg, P., Heskes, T.: Improving native language identification with TF-IDF weighting. In: Proceedings of BEA (2013)Google Scholar
  16. 16.
    Malmasi, S., Dras, M.: Language transfer hypotheses with linear SVM weights. In: Proceedings of EMNLP, pp. 1385–1390 (2014)Google Scholar
  17. 17.
    Malmasi, S., Dras, M., Johnson, M., Du, L., Wolska, M.: Unsupervised text segmentation based on native language characteristics. In: Proceedings of ACL (2017)Google Scholar
  18. 18.
    Malmasi, S., Tetreault, J., Dras, M.: Oracle and human baselines for native language identification. In: Proceedings of BEA (2015)Google Scholar
  19. 19.
    Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceedings of BEA (2013)Google Scholar
  20. 20.
    Malmasi, S., et al.: A report on the 2017 native language identification shared task. In: Proceedings of BEA (2017)Google Scholar
  21. 21.
    Malmasi, S., Dras, M.: Native Language Identification using Stacked Generalization. arXiv preprint arXiv:1703.06541 (2017)
  22. 22.
    Malmasi, S., Dras, M.: Native language identification with classifier stacking and ensembles. Computational Linguistics (2018)Google Scholar
  23. 23.
    Wong, S.M.J., Dras, M., Johnson, M.: Exploring adaptor grammars for native language identification. In: Proceedings of EMNLP (2012)Google Scholar
  24. 24.
    Tsur, O., Rappoport, A.: Using classifier features for studying the effect of native language on the choice of written second language words. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition (2007)Google Scholar
  25. 25.
    Malmasi, S., Wong, S.M.J., Dras, M.: NLI shared task 2013: MQ submission. In: Proceedings of BEA (2013)Google Scholar
  26. 26.
    Swanson, B., Charniak, E.: Data driven language transfer hypotheses. EACL 2014, 169 (2014)Google Scholar
  27. 27.
    Granger, S., Dagneaux, E., Meunier, F., Paquot, M.: International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvian-la-Neuve (2009)Google Scholar
  28. 28.
    Brooke, J., Hirst, G.: Measuring interlanguage: native language identification with L1-influence metrics. In: Proceedings of LREC (2012)Google Scholar
  29. 29.
    Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., Chodorow, M.: TOEFL11: a corpus of non-native English. Educational Testing Service, Technical report (2013)CrossRefGoogle Scholar
  30. 30.
    Malmasi, S., Dras, M.: Finnish native language identification. In: Proceedings of ALTA, Melbourne, Australia, pp. 139–144 (2014)Google Scholar
  31. 31.
    Wang, M., Malmasi, S., Huang, M.: The Jinan Chinese learner corpus. In: Proceedings of BEA (2015)Google Scholar
  32. 32.
    Tenfjord, K., Meurer, P., Hofland, K.: The ASK corpus: a language learner corpus of Norwegian as a second language. In: Proceedings of LREC (2006)Google Scholar
  33. 33.
    del Río, I., Zampieri, M., Malmasi, S.: A Portuguese native language identification dataset. In: Proceedings of BEA (2018)Google Scholar
  34. 34.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  35. 35.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)Google Scholar
  36. 36.
    Malmasi, S., Cahill, A.: Measuring feature diversity in native language identification. In: Proceedings of BEA (2015)Google Scholar
  37. 37.
    Malmasi, S., Dras, M., Zampieri, M.: LTG at SemEval-2016 Task 11: complex word identification with classifier ensembles. In: Proceedings of SemEval (2016)Google Scholar
  38. 38.
    Malmasi, S., Zampieri, M., Dras, M.: Predicting post severity in mental health forums. In: Proceedings of CLPsych (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Harvard Medical SchoolBostonUSA
  2. 2.University of LisbonLisbonPortugal
  3. 3.University of WolverhamptonWolverhamptonUK

Personalised recommendations