Multi-word Expressions: A Novel Computational Approach to Their Bottom-Up Statistical Extraction

  • Alexander Wahl
  • Stefan Th. GriesEmail author
Part of the Quantitative Methods in the Humanities and Social Sciences book series (QMHSS)


In this paper, we introduce and validate a new bottom-up approach to the identification/extraction of multi-word expressions in corpora. This approach, called Multi-word Expressions from the Recursive Grouping of Elements (MERGE), is based on the successive combination of bigrams to form word sequences of various lengths. The selection of bigrams to be “merged” is based on the use of a lexical association measure, log likelihood (Dunning, Computational Linguistics 19:61–74, 1993). We apply the algorithm to two corpora and test its performance both on its own merits and against a competing algorithm from the literature, the adjusted frequency list (O’Donnell, ICAME Journal 35:135–169, 2011). Performance of the algorithms is evaluated via human ratings of the multi-word expression candidates that they generate. Ultimately, MERGE is shown to offer a very competitive approach to MWE extraction.


  1. Bannard, C., & Matthews, D. (2008). Stored word sequences in language learning: The effect of familiarity on children’s repetition of four-word combinations. Psychological Science, 19, 241–248.CrossRefGoogle Scholar
  2. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.CrossRefGoogle Scholar
  3. Barton, K. (2015). MuMin: Multi-model inference. R package version, 1(13), 4 Scholar
  4. Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.CrossRefGoogle Scholar
  5. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.CrossRefGoogle Scholar
  6. Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33(5), 752–793.CrossRefGoogle Scholar
  7. Conklin, K., & Schmitt, N. (2012). The processing of formulaic language. Annual Review of Applied Linguistics, 32(1), 45–61.CrossRefGoogle Scholar
  8. Constant, M., Eryigit, G., Monti, J., van der Plas, L., & Ramisch, C. (2017). Michael, and Amalia Todirascu. Multiword Expression Processing: A Survey. Computational Linguistics., 43(4), 837–892.Google Scholar
  9. Da, S., Joaquin, F., Dias, G., Guilloré, S., & Pereira Lopes, J. G. (1999). Using LocalMax algorithm for the extraction of contiguous and non-contiguous multiword lexical units. Proceedings of the 9th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence, 849–849.Google Scholar
  10. Daudaraviĉius, V., & Murcinkeviĉiene, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2), 321–348.CrossRefGoogle Scholar
  11. Du Bois, J. W., & Englebretson, R. (2004). Santa Barbara corpus of spoken American English, part 3. Philadelphia: Linguistic Data Consortium.Google Scholar
  12. Du Bois, J. W., Chafe, W. L., Meyers, C., Thompson, S. A., & Martey, N. (2003). Santa Barbara corpus of spoken American English, part 2. Philadelphia: Linguistic Data Consortium.Google Scholar
  13. Du Bois, J. W., & Englebretson, R. (2005). Santa Barbara corpus of spoken American English, part 4. Philadelphia: Linguistic Data Consortium.Google Scholar
  14. Du Bois, J. W., Chafe, W. L., Meyers, C., & Thompson, S. A. (2000). Santa Barbara corpus of spoken American English, part 1. Philadelphia: Linguistic Data Consortium.Google Scholar
  15. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  16. Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20(1), 29–62.CrossRefGoogle Scholar
  17. Evert, S. (2005). The statistics of word co-occurrences: Word pairs and collocations. Ph. D. Dissertation. Universität Stuttgart. Google Scholar
  18. Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1212–1248). Berlin & New York: Mouton de Gruyter.Google Scholar
  19. Foster, P. (2001). Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching, and testing (pp. 75–93). Harlow: Longman.Google Scholar
  20. Green, S., de Marneffe, M.-C., Bauer, J., & Manning, C. D. (2013). Parsing models for identifying multiword expressions. Computational Linguistics, 39(1), 195–227.CrossRefGoogle Scholar
  21. Gries, S. T. (2008). Phraseology and linguistic theory: A brief survey. In S. Granger & F. Meunier (Eds.), Phraseology: An interdisciplinary perspective (pp. 3–25). Amsterdam: John Benjamins.CrossRefGoogle Scholar
  22. Gries, S. T. (2010). Useful statistics for corpus linguistics. In A. Sánchez & M. Almela (Eds.), A mosaic of corpus linguistics: Selected approaches (pp. 269–291). Peter Lang: Frankfurt am Main.Google Scholar
  23. Gries, S. T. (2012). Frequencies, probabilities, association measures in usage-/exemplar-based linguistics: Some necessary clarifications. Studies in Language, 36(3), 477–510.CrossRefGoogle Scholar
  24. Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165.CrossRefGoogle Scholar
  25. Gries, S. T., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4), 520–548.CrossRefGoogle Scholar
  26. Ikehara, S., Shirai, S., & Uchino, H. (1996). A statistical method for extracting uninterrupted and interrupted collocations from very large corpora. Proceedings of the 16e Conference on Computational linguistics, 1, 574–579.CrossRefGoogle Scholar
  27. Johnson, P. C. D. (2014). Extension of Nakagawa and Schielzeth’s R2GLMM to random slopes models. Methods in Ecology and Evolution, 5(9), 944–946.CrossRefGoogle Scholar
  28. Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B.. (2016). lmerTest: Tests in linear mixed effects models. R package version 2.0–30.
  29. Lareau, F., Dras, M., Börschinger, B., & Dale, R. (2011). Collocations in multilingual natural language generation: Lexical functions meet lexical functional grammar. In Proceedings of ALTA’11 (pp. 95–104).Google Scholar
  30. McEnery, T. (2006). Swearing in English: Bad language, purity and power from 1586 to the present. Abington. New York: Routledge.Google Scholar
  31. Nagao, M., & Mori, S. (1994). A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese. Proceedings of the. In 15thconference on computational linguistics (pp. 611–615).CrossRefGoogle Scholar
  32. Nakagawa, S., & Schielzeth, H. (2010). Repeatability for Gaussian and non-Gaussian data: A practical guide for biologists. Biological Reviews, 85(4), 935–956.Google Scholar
  33. Newman, J., & Columbus, G. (2010). The international Corpus of English – Canada. Edmonton, Alberta: University of Alberta.Google Scholar
  34. O’Donnell, M. B. (2011). The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 35, 135–169.Google Scholar
  35. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. Richards & R. Schmidt (Eds.), Language and communication (pp. 191–225). London: Longman.Google Scholar
  36. Pecina, P. (2009). Lexical association measures: Collocation extraction. Prague: Charles University.Google Scholar
  37. R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.Google Scholar
  38. Sag, I. A., Baldwin, T., bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. Proceedings of the third international conference on intelligent text processing and computational linguistics (pp. 1–15). Mexico City.Google Scholar
  39. Simpson-Vlach, R., & Ellis, N. (2010). An academic formulas list. Applied Linguistics, 31(4), 487–512.CrossRefGoogle Scholar
  40. Sinclair, J. (1987). Collins COBUILD English language dictionary. Ann Arbor: Collins.Google Scholar
  41. Siyanova-Chanturia, A., Conklin, K., & Schmitt, N. (2011). Adding more fuel to the fire: An eye-tracking study of idiom processing by native and non-native speakers. Second Language Research, 27(2), 251–272.CrossRefGoogle Scholar
  42. Wahl, A. (2015). Intonation unit boundaries and the storage of bigrams: Evidence from bidirectional and directional association measures. Review of Cognitive Linguistics, 13(1), 191–219.CrossRefGoogle Scholar
  43. Wible, D., Kuo, C.-H., Chen, M.-C., Tsao, N.-L., & Hung, T.-F. (2006). A computational approach to the discovery and representation of lexical chunks. Paper presented at TALN 2006. Leuven.Google Scholar
  44. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Radboud University, Donders Institute for Brain, Cognition and BehaviourNijmegenNetherlands
  2. 2.University of California, Santa BarbaraSanta BarbaraUSA
  3. 3.Justus Liebig University GiessenGiessenGermany

Personalised recommendations