Skip to main content

Analysing Keyword Lists

  • Chapter
  • First Online:
A Practical Handbook of Corpus Linguistics

Abstract

Frequency lists are useful in their own right for assisting a linguist, lexicographer, language teacher, or learner analyse or exploit a corpus. When employed comparatively through the keywords approach, significant changes in the relative ordering of words can flag points of interest. This conceptually simple approach of comparing one frequency list against another has been very widely exploited in corpus linguistics to help answer a vast number of research questions. In this chapter, we describe the method step-by-step to produce a keywords list, and then highlight two representative studies to illustrate the usefulness of the method. In our critical assessment of the keywords method, we highlight issues related to corpus design and comparability, the application of statistics, and clusters and n-grams to improve the method. We also describe important software tools and other resources, as well as providing further reading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Online calculators and downloadable spreadsheets are available at http://corpora.lancs.ac.uk/sigtest/ (accessed 25 June 2019) and http://ucrel.lancs.ac.uk/llwizard.html (accessed 25 June 2019).

  2. 2.

    It should be noted that this formula represents the 2-cell calculation (Rayson and Garside 2000) which can be used since the contribution from the other two cells is fairly constant and does not affect the ranking order. Other tools, e.g. AntConc, and statistical calculators also support the 4-cell calculation incorporating contributions from frequencies of the other words into the Log-Likelihood value.

References

  • Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4).

    Google Scholar 

  • Baker, P. (2017). British and American English: Divided by a common language? Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British Press. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Baron, A., Rayson, P., & Archer, D. (2009). Word frequency and key word statistics in corpus linguistics. Anglistik, 20(1), 41–67.

    Google Scholar 

  • Boneva, B., & Kraut, R. (2002). Email, gender, and personal relations. In B. Wellman & C. Haythornthwaite (Eds.), The internet in everyday life (pp. 372–403). Oxford: Blackwell.

    Chapter  Google Scholar 

  • Brezina, V., & Meyerhoff, M. (2014). Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28.

    Article  Google Scholar 

  • Cressie, N., & Read, T. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B: Methodological, 46(3), 440–464.

    Google Scholar 

  • Crossley, S. A., Defore, C., Kyle, K., Dai, J., & McNamara, D. S. (2013). Paragraph specific n-gram approaches to automatically assessing essay quality. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.), Proceedings of the 6th international conference on educational data mining (pp. 216–219). Heidelberg/Berlin: Springer.

    Google Scholar 

  • Culpeper, J. (2009). Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analysis. Corpora, 14(1), 77–104.

    Article  Google Scholar 

  • Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory, 1(2), 277–294.

    Article  Google Scholar 

  • Hardie, A. (2014). Log Ratio – an informal introduction. CASS blog: http://cass.lancs.ac.uk/?p=1133. Accessed 25 June 2019.

  • Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Bergen, Norway: The Norwegian Computing Centre for the Humanities.

    Google Scholar 

  • Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Paris: Mouton &.

    Google Scholar 

  • Kilgarriff, A. (1996). Why chi-square doesn’t work, and an improved LOB-Brown comparison. In Proceedings of the ALLC-ACH conference (pp. 169–172). Bergen: Norway.

    Google Scholar 

  • Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276.

    Article  Google Scholar 

  • Kyle, K., Crossley, S., Daim J., & McNamara, D. (2013, June 13). Native language identification: A key N-gram category approach. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 242–250). Atlanta, Georgia.

    Google Scholar 

  • Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.

    Article  Google Scholar 

  • Mahlberg, M. (2008). Clusters, key clusters and local textual functions in Dickens. Corpora, 2(1), 1–31.

    Article  Google Scholar 

  • Murphy, B. (2010). Corpus and sociolinguistics: Investigating age and gender in female talk. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Paquot, M. (2013). Lexical bundles and transfer effects. International Journal of Corpus Linguistics, 18(3), 391–417.

    Article  Google Scholar 

  • Paquot, M. (2014). Cross-linguistic influence and formulaic language: Recurrent word sequences in French learner writing. In L. Roberts, I. Vedder, & J. Hulstijn (Eds.), EUROSLA yearbook (pp. 216–237). Amsterdam: Benjamins.

    Google Scholar 

  • Paquot, M. (2017). L1 frequency in foreign language acquisition: Recurrent word combinations in French and Spanish EFL learner writing. Second Language Research, 33(1), 13–32.

    Article  Google Scholar 

  • Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In A. Jucker, D. Schreier, & M. Hundt (Eds.), Corpora: Pragmatics and discourse (pp. 247–269). Amsterdam: Rodopi.

    Google Scholar 

  • Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.

    Article  Google Scholar 

  • Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000), 1–8 October 2000, Hong Kong (pp. 1–6).

    Google Scholar 

  • Rayson, P., & Wilson, A. (1996). The ACAMRIT semantic tagging system: Progress report. In L. J. Evett & T. G. Rose (Eds.), Language engineering for document analysis and recognition, LEDAR, AISB96 workshop proceedings (pp. 13–20). Brighton: Faculty of Engineering and Computing, Nottingham Trent University, UK.

    Google Scholar 

  • Rayson, P., Berridge, D., & Francis, B. (2004a, March 10–12). Extending the Cochran rule for the comparison of word frequencies between corpora. In Purnelle, G., Fairon, C., & Dister, A. (Eds.) Le poids des mots: Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004) (Vol. II, pp. 926–936), Louvain-la-Neuve: Presses Universitaires de Louvain.

    Google Scholar 

  • Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004b). The UCREL semantic analysis system. In Proceedings of the workshop on beyond named entity recognition semantic labelling for NLP tasks in association with 4th international conference on language resources and evaluation (LREC 2004), 7–12. 25th may 2004, Lisbon, Portugal. Paris: European Language Resources Association.

    Google Scholar 

  • Scott, M. (1997). PC analysis of key words – And key key words. System, 25(2), 233–245.

    Article  Google Scholar 

  • Scott, M. (2004). WordSmith tools. Version 4.0. Oxford: Oxford University Press. ISBN: 0-19-459400-9.

    Google Scholar 

  • Seale, C., Ziebland, S., & Charteris-Black, J. (2006). Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science & Medicine, 62, 2577–2590.

    Article  Google Scholar 

  • Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.

    Google Scholar 

  • Tono, Y., Yamazaki, M., & Maekawa, K. (2013). A frequency dictionary of Japanese. Routledge.

    Google Scholar 

  • Vasishth, S., & Nicenboim, B. (2016). Statistical methods for linguistic research: Foundational ideas – Part I. Lang & Ling Compass, 10, 349–369. https://doi.org/10.1111/lnc3.12201.

    Article  Google Scholar 

  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.

    Article  Google Scholar 

  • Wilson, A. (2013). Embracing Bayes factors for key item analysis in corpus linguistics. In New approaches to the study of linguistic variability. Language competence and language awareness in Europe (pp. 3–11). Frankfurt: Peter Lang.

    Google Scholar 

  • Wilson, A., & Rayson, P. (1993). Automatic content analysis of spoken discourse. In C. Souter & E. Atwell (Eds.), Corpus based computational linguistics (pp. 215–226). Amsterdam: Rodopi.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Rayson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rayson, P., Potts, A. (2020). Analysing Keyword Lists. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_6

Download citation

Publish with us

Policies and ethics