Skip to main content

Feature Analysis for Native Language Identification

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9041))

Abstract

In this study we investigate the role of different features for the task of native language identification. For this purpose, we compile a learner corpus based on a subset of the EF Cambridge Open Language Database - EFCAMDAT [10] developed at the University of Cambridge in collaboration with EF Education. The features we are taking into consideration include character n-grams, positional token frequencies, part of speech n-grams, function words, shell nouns and a set of annotated errors. Last but not least, we examine whether the essays of English learners that share the same mother tongue can be distinguished based on their country of origin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brooke, J., Hirst, G.: Native language detection with ‘cheap’ learner corpora. In: Conference of Learner Corpus Research (LCR 2011). Presses universitaires de Louvain, Louvain-la-Neuve (2011)

    Google Scholar 

  2. Chomsky, N.A.: Linguistics and philosophy. In: Hook, S. (ed.) Language and Philosophy. New York University Press (1969)

    Google Scholar 

  3. Corder, S.P.: Language distance and the magnitude of the language learning task. Studies in Second Language Acquisition 2, 27–36 (1979)

    Article  Google Scholar 

  4. Dinu, A.: On classifying coherent/incoherent romanian short texts. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech (2008)

    Google Scholar 

  5. Dinu, L.P., Niculae, V., Şulea, O.M.: Pastiche detection based on stopword rankings: exposing impersonators of a romanian writer. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, EACL 2012, pp. 72–77. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  6. Dumais, S.: Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23(2), 229–236 (1991)

    Article  Google Scholar 

  7. Englishtown: Education first (2012), http://www.englishtown.com/

  8. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  9. Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Knight, K., Ng, H.T., Oflazer, K. (eds.) ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, University of Michigan, USA, June 25-30. The Association for Computer Linguistics (2005)

    Google Scholar 

  10. Geertzen, J., Alexopoulou, T., Korhonen, A.: Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In: Proceedings of the 31st Second Language Research Forum (SLRF). Cascadilla Press, MA (2013)

    Google Scholar 

  11. Granger, S., Dagneaux, E., Meunier, F.: The International Corpus of Learner English: Handbook and CD-ROM, version 2. Presses Universitaires de Louvain, Louvain-la-Neuve (2009)

    Google Scholar 

  12. Ionescu, T.R., Popescu, M., Cahill, A.: Can characters reveal your native language? a language-independent approach to native language identification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1363–1373. Association for Computational Linguistics (2014)

    Google Scholar 

  13. Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 111–118. Association for Computational Linguistics, Atlanta (2013)

    Google Scholar 

  14. Kolhatkar, V., Zinsmeister, H., Hirst, G.: Interpreting anaphoric shell nouns using antecedents of cataphoric shell nouns as training data. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 300–310. Association for Computational Linguistics (2013)

    Google Scholar 

  15. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)

    Article  Google Scholar 

  16. Koppel, M., Schler, J., Zigdon, K.: Automatically determining an anonymous author’s native language. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 209–217. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  17. Koppel, M., Schler, J., Zigdon, K.: Determining an author’s native language by mining a text for errors. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 624–628. ACM, Chicago (2005)

    Chapter  Google Scholar 

  18. Landauer, T., McNamara, D., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Taylor and Francis (2013)

    Google Scholar 

  19. Lim, C., Lee, K., Kim, G.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41(5), 1263–1276 (2005)

    Article  Google Scholar 

  20. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)

    MATH  Google Scholar 

  21. Long, M.H.: Maturational constraints on language development. Studies in Second Language Acquisition 12, 251–285 (1990)

    Article  Google Scholar 

  22. Münte, T.F., Wieringa, B.M., Weyerts, H., Szentkuti, A., Matzke, M., Johannes, S.: Differences in brain potentials to open and closed class words: class and frequency effects. Neuropsychologia 39(1), 91–102 (2001)

    Article  Google Scholar 

  23. Nagata, R., Whittaker, E.W.D.: Reconstructing an indo-european family tree from non-native english texts. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Volume 1: Long Papers, Sofia, Bulgaria, August 4-9, pp. 1137–1147 (2013)

    Google Scholar 

  24. Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, pp. 142–147 (2003)

    Google Scholar 

  25. Schmid, H.U.: English Abstract Nouns As Conceptual Shells: From Corpus to Cognition. In: Topics in English Linguistics 34. De Gruyter Mouton, Berlin (2000)

    Google Scholar 

  26. Selinker, L., Rutherford, W.: Rediscovering Interlanguage. Applied Linguistics and Language Study. Routledge (2014)

    Google Scholar 

  27. Selinker, L.: Interlanguage. International Review of Applied Linguistics in Language Teaching 10(1-4), 209–232 (1972)

    Article  Google Scholar 

  28. Tarone, E.: Interlanguage. Blackwell Publishing Ltd. (2012)

    Google Scholar 

  29. Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, Atlanta (2013)

    Google Scholar 

  30. Tetreault, J., Blanchard, D., Cahill, A., Chodorow, M.: Native tongues, lost and found: Resources and empirical evaluations in native language identification. In: Proceedings of COLING 2012, pp. 2585–2602. The COLING 2012 Organizing Committee, Mumbai (2012)

    Google Scholar 

  31. Tsur, O., Rappoport, A.: Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. In: Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pp. 9–16. Association for Computational Linguistics, Prague (2007)

    Chapter  Google Scholar 

  32. Tsvetkov, Y., Twitto, N., Schneider, N., Ordan, N., Faruqui, M., Chahuneau, V., Wintner, S., Dyer, C.: Identifying the l1 of non-native writers: the cmu-haifa system. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 279–287. Association for Computational Linguistics, Atlanta (2013)

    Google Scholar 

  33. Valette, R.M.: Proficiency and the prevention of fossilization an editorial. The Modern Language Journal 75(3), 325–328 (1991)

    Article  Google Scholar 

  34. Volansky, V., Ordan, N., Wintner, S.: On the features of translationese. Literary and Linguistic Computing (2013)

    Google Scholar 

  35. Yi-Wei, C., Chih-Jen, L.: Combining svms with various feature selection strategies. Feature Extraction 207, 315–324 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Nisioi, S. (2015). Feature Analysis for Native Language Identification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18111-0_49

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18110-3

  • Online ISBN: 978-3-319-18111-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics