Skip to main content

Analyzing Co-occurrence Data

  • Chapter
  • First Online:
A Practical Handbook of Corpus Linguistics

Abstract

In this chapter, we provide an overview of quantitative approaches to co-occurrence data. We begin with a brief terminological overview of different types of co-occurrence that are prominent in corpus-linguistic studies and then discuss the computation of some widely-used measures of association used to quantify co-occurrence. We present two representative case studies, one exploring lexical collocation and learner proficiency, the other creative uses of verbs with argument structure constructions. In addition, we highlight how most widely-used measures actually all fall out from viewing corpus-linguistic association as an instance of regression modeling and discuss newer developments and potential improvements of association measure research such as utilizing directional measures of association, not uncritically conflating frequency and association-strength information in association measures, type frequencies, and entropies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We are ignoring the lexico-textual co-occurrence sense of colligation here.

  2. 2.

    While AMs often agree fairly well in their assessment of the degree of attraction between two elements (or at least their overall ranking), their computation can lead to them having different ‘preferences’. For instance, pointwise MI is known to return low-frequency but perfectly predictive collocations (e.g. fixed expressions) whereas measures that are ultimately based on significance tests (such as G 2 or t) often rank more frequent items higher; see Evert (2009) for more discussion.

  3. 3.

    The Kullback-Leibler divergence is also already mentioned in Pecina (2010).

  4. 4.

    See Michelbacher et al. (2007, 2011) and Gries (2013a) for further explorations of uni-directional/asymmetric measures.

References

  • Ackermann, K., & Chen, Y. H. (2013). Developing the academic collocation list (ACL) – A corpus-driven and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247.

    Article  Google Scholar 

  • Baayen, R. H. (2011). Corpus linguistics and naive discriminative learning. Brazilian Journal of Applied Linguistics, 11(2), 295–328.

    Google Scholar 

  • Bartsch, S. (2004). Structural and functional properties of collocations in English. Tübingen: NARR.

    Google Scholar 

  • Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26(4), 28–41.

    Article  Google Scholar 

  • Daudaravičius, V., & Marcinkevičienė, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2), 321–348.

    Article  Google Scholar 

  • Durrant, P. (2014). Corpus frequency and second language learners’ knowledge of collocations. International Journal of Corpus Linguistics, 19(4), 443–477.

    Article  Google Scholar 

  • Durrant, P., & Schmitt, N. (2009). To what extent do native and non-native writers make use of collocations? International Review of Applied Linguistics, 47(2), 157–177.

    Article  Google Scholar 

  • Ellis, N. C. (2007). Language acquisition as rational contingency learning. Applied Linguistics, 27(1), 1–24.

    Article  Google Scholar 

  • Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic language in native and second-language speakers: Psycholinguistics, corpus linguistics, and TESOL. TESOL Quarterly, 1(3), 375–396.

    Article  Google Scholar 

  • Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 1212–1248). Berlin/New York: Mouton De Gruyter.

    Google Scholar 

  • Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. Reprinted in Palmer FR (Ed.), (1968) Selected papers of J.R. Firth, 1952–1959. Longman, London.

    Google Scholar 

  • Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.

    Google Scholar 

  • Goldberg, A. E., Casenhiser, D. M., & Sethuraman, N. (2004). Learning argument structure generalizations. Cognitive Linguistics, 15(3), 289–316.

    Article  Google Scholar 

  • Gries, S. Th. (2008a). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.

    Google Scholar 

  • Gries, S. Th. (2008b). Phraseology and linguistic theory: A brief survey. In S. Granger & F. Meunier (Eds.), Phraseology: An interdisciplinary perspective (pp. 3–25). Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Gries, S. Th. (2012). Frequencies, probabilities, association measures in usage−/exemplar-based linguistics: Some necessary clarifications. Studies in Language, 36(3), 477–510.

    Google Scholar 

  • Gries, S. Th. (2013a). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165.

    Google Scholar 

  • Gries, S. Th. (2013b). Statistics for linguistics with R (2nd rev. & ext. ed) De Gruyter Mouton: Boston/New York.

    Google Scholar 

  • Gries, S. Th. (2015). More (old and new) misunderstandings of collostructional analysis: On Schmid & Küchenhoff (2013). Cognitive Linguistics, 26(3), 505–536.

    Google Scholar 

  • Gries, S. Th. (2015). 15 years of collostructions: some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, 24(3), 385–412.

    Google Scholar 

  • Gries, S. Th. (2018). On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally. Journal of Second Language Studies, 1(2), 276–308.

    Article  Google Scholar 

  • Gries, S. T. (2019). 15 years of collostructions: Some long overdue additions/corrections (to/of actually all sorts of corpus-linguistics measures). International Journal of Corpus Linguistics, 24, 385.

    Google Scholar 

  • Gries, S. Th., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4), 520–548.

    Article  Google Scholar 

  • Gries, S. Th., Hampe, B., & Schönefeld, D. (2005). Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics, 16(4), 635–676.

    Article  Google Scholar 

  • Hampe, B., & Schönefeld, D. (2006). Syntactic leaps or lexical variation? – More on “Creative Syntax”. In S. T. Gries & A. Stefanowitsch (Eds.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp. 127–157). Berlin/New York: Mouton de Gruyter.

    Google Scholar 

  • Harris, Z. S. (1970). Papers in structural and transformational linguistics. Dordrecht: Reidel.

    Book  Google Scholar 

  • Lester, N. A., & Moscoso del Prado, M. F. (2016). Syntactic flexibility in the noun: Evidence from picture naming. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th annual conference of the cognitive science society (pp. 2585–2590). Austin: Cognitive Science Society.

    Google Scholar 

  • Linzen, T., & Jaeger, T. F. (2015). Uncertainty and expectation in sentence processing: Evidence from subcategorization distributions. Cognitive Science, 40(6), 1382–1411.

    Article  Google Scholar 

  • McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. Oxon/New York: Routledge.

    Google Scholar 

  • Michelbacher, L., Evert, S., & Schütze, H. (2007). Asymmetric association measures. International Conference on Recent Advances in Natural Language Processing.

    Google Scholar 

  • Michelbacher, L., Evert, S., & Schütze, H. (2011). Asymmetry in corpus-derived and human word associations. Corpus Linguistics and Linguistic Theory, 7(2), 245–276.

    Article  Google Scholar 

  • Mollin, S. (2009). Combining corpus linguistic and psychological data on word co-occurrences: Corpus collocates versus word associations. Corpus Linguistics and Linguistic Theory, 5(2), 175–200.

    Article  Google Scholar 

  • Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1), 137–158.

    Article  Google Scholar 

  • Schneider, U. (to appear). Delta P as a measure of collocation strength. Corpus Linguistics and Linguistic Theory.

    Google Scholar 

  • Siyanova-Chanturia, A. (2015). Collocation in beginner learner writing: A longitudinal study. System, 53(4), 148–160.

    Article  Google Scholar 

  • Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.

    Article  Google Scholar 

  • Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Th. Gries .

Editor information

Editors and Affiliations

1 Electronic Supplementary Materials

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Gries, S.T., Durrant, P. (2020). Analyzing Co-occurrence Data. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_7

Download citation

Publish with us

Policies and ethics