Abstract

In this study we investigate the role of different features for the task of native language identification. For this purpose, we compile a learner corpus based on a subset of the EF Cambridge Open Language Database - EFCAMDAT [10] developed at the University of Cambridge in collaboration with EF Education. The features we are taking into consideration include character n-grams, positional token frequencies, part of speech n-grams, function words, shell nouns and a set of annotated errors. Last but not least, we examine whether the essays of English learners that share the same mother tongue can be distinguished based on their country of origin.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brooke, J., Hirst, G.: Native language detection with ‘cheap’ learner corpora. In: Conference of Learner Corpus Research (LCR 2011). Presses universitaires de Louvain, Louvain-la-Neuve (2011)Google Scholar
  2. 2.
    Chomsky, N.A.: Linguistics and philosophy. In: Hook, S. (ed.) Language and Philosophy. New York University Press (1969)Google Scholar
  3. 3.
    Corder, S.P.: Language distance and the magnitude of the language learning task. Studies in Second Language Acquisition 2, 27–36 (1979)CrossRefGoogle Scholar
  4. 4.
    Dinu, A.: On classifying coherent/incoherent romanian short texts. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008)Google Scholar
  5. 5.
    Dinu, L.P., Niculae, V., Şulea, O.M.: Pastiche detection based on stopword rankings: exposing impersonators of a romanian writer. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, EACL 2012, pp. 72–77. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  6. 6.
    Dumais, S.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23(2), 229–236 (1991)CrossRefGoogle Scholar
  7. 7.
    Englishtown: Education first (2012), http://www.englishtown.com/
  8. 8.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  9. 9.
    Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Knight, K., Ng, H.T., Oflazer, K. (eds.) ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, University of Michigan, USA, June 25-30. The Association for Computer Linguistics (2005)Google Scholar
  10. 10.
    Geertzen, J., Alexopoulou, T., Korhonen, A.: Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In: Proceedings of the 31st Second Language Research Forum (SLRF). Cascadilla Press, MA (2013)Google Scholar
  11. 11.
    Granger, S., Dagneaux, E., Meunier, F.: The International Corpus of Learner English: Handbook and CD-ROM, version 2. Presses Universitaires de Louvain, Louvain-la-Neuve (2009)Google Scholar
  12. 12.
    Ionescu, T.R., Popescu, M., Cahill, A.: Can characters reveal your native language? a language-independent approach to native language identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1363–1373. Association for Computational Linguistics (2014)Google Scholar
  13. 13.
    Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 111–118. Association for Computational Linguistics, Atlanta (2013)Google Scholar
  14. 14.
    Kolhatkar, V., Zinsmeister, H., Hirst, G.: Interpreting anaphoric shell nouns using antecedents of cataphoric shell nouns as training data. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 300–310. Association for Computational Linguistics (2013)Google Scholar
  15. 15.
    Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)CrossRefGoogle Scholar
  16. 16.
    Koppel, M., Schler, J., Zigdon, K.: Automatically determining an anonymous author’s native language. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 209–217. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  17. 17.
    Koppel, M., Schler, J., Zigdon, K.: Determining an author’s native language by mining a text for errors. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 624–628. ACM, Chicago (2005)CrossRefGoogle Scholar
  18. 18.
    Landauer, T., McNamara, D., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Taylor and Francis (2013)Google Scholar
  19. 19.
    Lim, C., Lee, K., Kim, G.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41(5), 1263–1276 (2005)CrossRefGoogle Scholar
  20. 20.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)MATHGoogle Scholar
  21. 21.
    Long, M.H.: Maturational constraints on language development. Studies in Second Language Acquisition 12, 251–285 (1990)CrossRefGoogle Scholar
  22. 22.
    Münte, T.F., Wieringa, B.M., Weyerts, H., Szentkuti, A., Matzke, M., Johannes, S.: Differences in brain potentials to open and closed class words: class and frequency effects. Neuropsychologia 39(1), 91–102 (2001)CrossRefGoogle Scholar
  23. 23.
    Nagata, R., Whittaker, E.W.D.: Reconstructing an indo-european family tree from non-native english texts. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Volume 1: Long Papers, Sofia, Bulgaria, August 4-9, pp. 1137–1147 (2013)Google Scholar
  24. 24.
    Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, pp. 142–147 (2003)Google Scholar
  25. 25.
    Schmid, H.U.: English Abstract Nouns As Conceptual Shells: From Corpus to Cognition. In: Topics in English Linguistics 34. De Gruyter Mouton, Berlin (2000)Google Scholar
  26. 26.
    Selinker, L., Rutherford, W.: Rediscovering Interlanguage. Applied Linguistics and Language Study. Routledge (2014)Google Scholar
  27. 27.
    Selinker, L.: Interlanguage. International Review of Applied Linguistics in Language Teaching 10(1-4), 209–232 (1972)CrossRefGoogle Scholar
  28. 28.
    Tarone, E.: Interlanguage. Blackwell Publishing Ltd. (2012)Google Scholar
  29. 29.
    Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Atlanta (2013)Google Scholar
  30. 30.
    Tetreault, J., Blanchard, D., Cahill, A., Chodorow, M.: Native tongues, lost and found: Resources and empirical evaluations in native language identification. In: Proceedings of COLING 2012, pp. 2585–2602. The COLING 2012 Organizing Committee, Mumbai (2012)Google Scholar
  31. 31.
    Tsur, O., Rappoport, A.: Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 9–16. Association for Computational Linguistics, Prague (2007)CrossRefGoogle Scholar
  32. 32.
    Tsvetkov, Y., Twitto, N., Schneider, N., Ordan, N., Faruqui, M., Chahuneau, V., Wintner, S., Dyer, C.: Identifying the l1 of non-native writers: the cmu-haifa system. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 279–287. Association for Computational Linguistics, Atlanta (2013)Google Scholar
  33. 33.
    Valette, R.M.: Proficiency and the prevention of fossilization an editorial. The Modern Language Journal 75(3), 325–328 (1991)CrossRefGoogle Scholar
  34. 34.
    Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Literary and Linguistic Computing (2013)Google Scholar
  35. 35.
    Yi-Wei, C., Chih-Jen, L.: Combining svms with various feature selection strategies. Feature Extraction 207, 315–324 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sergiu Nisioi
    • 1
    • 2
  1. 1.Center for Computational LinguisticsUniversity of BucharestBucharestRomania
  2. 2.Oracle RightNowBucharestRomania

Personalised recommendations