Skip to main content
Log in

Towards advanced collocation error correction in Spanish learner corpora

  • SI: Resources for language learning
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Collocations in the sense of idiosyncratic binary lexical co-occurrences are one of the biggest challenges for any language learner. Even advanced learners make collocation mistakes in that they literally translate collocation elements from their native tongue, create new words as collocation elements, choose a wrong subcategorization for one of the elements, etc. Therefore, automatic collocation error detection and correction is increasingly in demand. However, while state-of-the-art models predict, with a reasonable accuracy, whether a given co-occurrence is a valid collocation or not, only few of them manage to suggest appropriate corrections with an acceptable hit rate. Most often, a ranked list of correction options is offered from which the learner has then to choose. This is clearly unsatisfactory. Our proposal focuses on this critical part of the problem in the context of the acquisition of Spanish as second language. For collocation error detection, we use a frequency-based technique. To improve on collocation error correction, we discuss three different metrics with respect to their capability to select the most appropriate correction of miscollocations found in our learner corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. In accordance with the terminology in Second Language Learning literature, we refer to the native tongue of the learner as ‘L1’ and to her second language as ‘L2’.

  2. Liu et al. (2009) achieve a higher accuracy, however, they start from a manually compiled list of miscollocations rather than from an automatically retrieved list.

  3. CEDEL2 is an L1 English–L2 Spanish learner corpus under construction by Cristóbal Lozano in the framework of a bigger corpus-oriented project directed by Amaya Medikoetxea at the Universidad Autónoma de Madrid. Currently, CEDEL2 contains about 730.000 words of essays in Spanish on a predefined range of topics by native speakers of English and (to a smaller extent, for contrastive studies) by native speakers of Spanish. The topics include, among others How is the region where you live?, How do your plans for the future look like?, How did you spend your last holidays?, Analyze the major aspects of immigration, and so on. The level of Spanish of the authors of the essays ranges from “elementary” over “lower intermediate”, “intermediate”, and “advanced” to “very advanced”. Further information on CEDEL2 can be obtained from http://www.uam.es/proyectoinv/woslac/cedel2.htm and (Lozano 2009; Lozano and Mendikoetxea 2013).

  4. See, e.g., (Atwell 1987; Knight and Chander 1994; Hermet et al. 2008; Meurers 2013).

  5. Recall that we use the term collocation in the sense of idiosyncratic binary lexical co-occurrences, i.e., following the lexicographic tradition (Hausmann 1989; Cowie 1994; Mel’čuk 1995).

  6. Automatic classification of miscollocations encountered in essays with respect to this typology is another big challenge, which remains to be tackled.

  7. This annotation schema is currently used to annotate a fragment of CEDEL2.

  8. The details of the filtering stage and the size of the correction list on which they calculate the reported MRR are not explicitly discussed in (Wu et al. 2010). However, we can deduce both from experiments with the MUST Collocation checker (http://miscollocation.appspot.com), which is based on their proposal.

  9. Roughly speaking, members of the same “collocation cluster” are values of the same lexical function in the sense of Mel’čuk (1995).

  10. As rightly pointed out by one of the reviewers, apart from graphic similarity, phonetic similarity should also be considered. A large number of phonetic distance measures is available; see (Kessler 2005) for an in-depth discussion—starting with the implementation of Russel and Odell’s Soundex for English.

  11. Obviously, this strategy harbors the danger of contextual feature occurrence sparseness if the learner uses a (mis)collocation in very idiosyncratic contexts.

  12. In contrast to information retrieval-oriented search, we do not eliminate from the context the functional words (which are otherwise considered to be “stop words” that do not contribute to the quality of the search) since they are essential for our task.

  13. As a matter of fact, proper names are poor features for our task. We plan to discard them in the future experiments.

  14. The suggestion *agenciar [una] cita as a possible candidate is due to the wrong PoS tagging of the bigram agencia cita ‘agency cites’, which is very common in a newspaper corpus such as ours.

  15. This gives us a hint that a newspaper material corpus is not a well-balanced corpus for the purposes of collocation-oriented CALL.

  16. However, it is a standard collocation in Argentinian Spanish.

  17. The context feature metric considered thus the same “features” as the lexical context metrics—only that it interpreted them differently.

  18. As already mentioned above, MUST is an implementation of (Wu et al. 2010).

  19. Instead of grow up mind, we introduced grow mind since MUST does not process collocations with phrasal verbs.

  20. Since MUST’s corrections for grow mind did not include any right correction, we added cultivate mind (the correction suggested by Li) to the bag of suggestions for grow mind.

References

  • Alonso Ramos, M., Wanner, L., Vázquez, N., Vincze, O., Mosqueira, E., & Prieto S. (2010a). Tagging collocations for learners. In: S. Granger & M. Paquot (Eds.), eLexicography in the 21st century: New challenges, new applications. Proceedings of eLex 2009, Cahiers du Cental, volume 7, Louvain-la-Neuve.

  • Alonso Ramos, M., Wanner, L., Vincze, O., Casamayor, G., Vázquez, N., Mosqueira, E., & Prieto, S. (2010b). Towards a motivated annotation schema of collocation errors in learner corpora. In Proceedings of LREC 2010, Malta.

  • Atwell, E. (1987). How to detect grammatical errors in a text without parsing it. In Proceedings of the EACL Conference (pp. 38–45). Copenhagen, Denmark.

  • Bouma, G. (2010). Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference, Short paper track, Uppsala.

  • Chang, Y. C., Chang J. S., Chen H. J., & Liou, H. C. (2008). An automatic collocation writing assistant for Taiwanese EFL learners. A case of corpus-based NLP technology. Computer Assisted Language Learning, 21(3), 283–299.

    Article  Google Scholar 

  • Chen, H. (2009). Microsoft ESL assistant and NTNU statistical grammar checker. Computational Linguistics and Chinese Language Processing, 14(2), 161–180.

    Google Scholar 

  • Choueka, Y. (1988). Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO (pp. 34–38).

  • Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and Lexicography. In Proceedings of the 27th Annual Meeting of the ACL (pp. 76–83).

  • Cowie, A. P. (1994). Phraseology. In: R. E. Asher & J. Simpson (Eds.), The encyclopedia of language and linguistics (Vol. 6, pp. 3168–3171). Pergamon, Oxford.

  • Dahlmeier, D., & Ng, H. T. (2011). Correcting semantic collocation errors with L1-induced paraphrases. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 107–117). Edinburgh, Scotland.

  • Evert, S. (2007). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics. An international handbook. Berlin: Mouton de Gruyter.

    Google Scholar 

  • Evert, S., & Kermes, H. (2003). Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of the EACL (pp. 83–86).

  • Futagi, Y., Deane, P., Chodorow, M., & Tetreault, J. (2008). A computational approach to detecting collocation errors in the writing of non-native speakers of English. Computer Assisted Language Learning, 21(1), 353–367.

    Article  Google Scholar 

  • Gamon, M., Leacock, C., Brockett, C., Dolan, W., Gao, J., & Belenko, D. (2009). Using statistical techniques and web search to correct ESL errors. CALICO Journal, 26(3), 491–511.

    Google Scholar 

  • Gilquin, G. (2007). To err is not all. What corpus and elicitation can reveal about the use of collocations by learners. Zeitschrift für Anglistik und Amerikanistik, 55(3), 273–291.

    Article  Google Scholar 

  • Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In: A. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145–160). Oxford University Press, Oxford.

    Google Scholar 

  • Hausmann, F.-J. (1984). Wortschatzlernen ist Kollokationslernen. Zum Lehren und Lernen französischer Wortwendungen. Praxis des neusprachlichen Unterrichts, 31(4), 395–406.

    Google Scholar 

  • Hausmann, F.-J. (1989). Le dictionnaire de collocations. In F.-J. Hausmann, P. Reichmann, H. E. Wiegang, & L. Zgusta (Eds.), Wörterbücher, dictionaries, dictionnaires. Ein internationales Handbuch. Berlin; De Gruyter.

    Google Scholar 

  • Hermet, M., Désilets A., & Szpakowicz, S. (2008). Using the web as a linguistic resource to automatically correct lexico-syntactic errors. In Proceedings of the LREC 2008 (pp. 54–57), Marrakech.

  • Howarth, P. (1998a). Phraseology and second language acquisition. Applied Linguistics, 19(1), 24–44.

    Article  Google Scholar 

  • Howarth, P. (1998b). The phraseology of learner’s academic writing. In: A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 161–186). Oxford: Oxford University Press.

    Google Scholar 

  • Kessler, B. (2005). Phonetic comparison algorithms. Transactions of the Philological Society, 103(2), 243–260.

    Article  Google Scholar 

  • Kilgarriff, A. (2006). Collocationality (and how to measure it). In Proceedings of the 12th EURALEX International Congress, Torino.

  • Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the ACL Conference (pp. 423–430).

  • Knight, K., & Chander, I. (1994). Automated postediting of documents. In Proceedings of the AAAI Conference (pp. 779–784) Seattle, WA.

  • Lesniewska, J. (2006). Collocations and second language use. Studia Linguistica Universitatis lagellonicae Cracoviensis, 123, 95–105.

    Google Scholar 

  • Lewis, M. (2000). Teaching collocation. Further developments in the lexical approach. London: LTP.

    Google Scholar 

  • Li, C. C. (2005). A Study of collocational error types in ESL/EFL College learners. Ph.D. thesis, Ming Chuan University College of Applied Languages, Department of Applied English.

  • Liu, A.L.-E., Wible, D., & Tsao, N.-L. (2009). Automated suggestions for miscollocations. In Proceedings of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications (pp. 47–50). Boulder, CO.

  • Lozano, C. (2009). CEDEL2: Corpus escrito del español L2. In C. M. Bretones Callejas (Ed.), Applied linguistics now: Understanding language and mind (pp. 197–212). Almería: Universidad de Almería.

    Google Scholar 

  • Lozano, C., & Mendikoetxea, A. (2013). Learner corpora and second language acquisition: The design and collection of CEDEL2. In A. Díaz-Negrillo, N. Ballier, & P. Thompson, (Eds.), Automatic treatment and analysis of learner corpus data. Amsterdam: Benjamins Academic Publishers.

  • Mel’čuk, I. A. (1995). Phrasemes in language and phraseology in linguistics. In: M. Everaert, E.-J. van der Linden, A. Schenk & R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 167–232). Hillsdale: Lawrence Erlbaum Associates.

    Google Scholar 

  • Meurers, D. (2013). Natural language processing and language learning. In: C. A. Chapelle (Ed.), Encyclopedia of applied linguistics (pp. 1–13). Hoboken: Blackwell.

  • Nation, I. S. P. (2001). Learning language in another language. Cambridge: Cambridge University Press.

    Google Scholar 

  • Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242.

    Article  Google Scholar 

  • Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: Benjamins Academic Publishers.

    Book  Google Scholar 

  • Pantel, P., & Lin, D. (2000). Word-for-word glossing with contextually similar words. In Proceedings of 4th NAACL Conference (pp 78–85). Seattle.

  • Park, T., Lank, E., Poupart, P., & Terry, M. (2008). Is the sky pure today? AwkChecker: An assistive tool for detecting and correcting errors. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST ’08), New York.

  • Pecina, P. (2008). A machine learning approach to multiword expression extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) (pp. 54–57), Marrakech.

  • Shei, C. C., & Pain, H. (2000). An ESL writer’s collocation aid. Computer Assisted Language Learning, 13(2), 167–182.

    Article  Google Scholar 

  • Smadja, F. (1993). Retrieving collocations from text: X-Tract. Computational Linguistics, 19(1), 143–177.

    Google Scholar 

  • Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.

    Book  Google Scholar 

  • Wanner, L., Bohnet, B., & Giereth, M. (2006). Making sense of collocations. Computer Speech and Language, 20(4), 609–624.

    Article  Google Scholar 

  • Wible, D., Kuo, C.-H., Tsao, N.-L., Liu, A. L-E., & Lin, H.-L. (2003). Bootstrapping in a language learning environment. Journal of Computer Assisted Learning, 19(4), 90–102.

    Article  Google Scholar 

  • Wible, D., & Tsao, N. L. (2010). Stringnet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL-HLT Workshop on Extracting and Using Constructions in Computational Linguistics, Los Angeles.

  • Wu, J.-C., Chang, Y.-C., Mitamura, T., & Chang, J. S. (2010). Automatic collocation suggestion in academic writing. In Proceedings of the ACL Conference, Short paper track, Uppsala.

  • Yin, X., Gao, J., & Dolan, W. (2008). A web-based English proofing system for English as a second language users. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (pp. 619–624). Hyderabad, India.

Download references

Acknowledgments

Many thanks to Amaya Medikoetxea and Cristóbal Lozano for making the CEDEL2 corpus available to us and to the two anonymous reviewers for their insightful comments, which considerably improved the final version of the paper. Our experiments have been partially run on the Argo cluster of the Department of Communication and Information Technologies, UPF. We are grateful for this service and would like to thank especially Silvina Re and Iván Jiménez for their help. This work has been partially funded by the Spanish Ministry of Science and Innovation under the contract numbers FFI2008-06479-C02-01/02 and FFI2011-30219-CO2-01/02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leo Wanner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferraro, G., Nazar, R., Alonso Ramos, M. et al. Towards advanced collocation error correction in Spanish learner corpora. Lang Resources & Evaluation 48, 45–64 (2014). https://doi.org/10.1007/s10579-013-9242-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9242-3

Keywords

Navigation