Analysing Keyword Lists

Rayson, Paul; Potts, Amanda

doi:10.1007/978-3-030-46216-1_6

Paul Rayson³ &
Amanda Potts⁴

1953 Accesses
3 Citations

Abstract

Frequency lists are useful in their own right for assisting a linguist, lexicographer, language teacher, or learner analyse or exploit a corpus. When employed comparatively through the keywords approach, significant changes in the relative ordering of words can flag points of interest. This conceptually simple approach of comparing one frequency list against another has been very widely exploited in corpus linguistics to help answer a vast number of research questions. In this chapter, we describe the method step-by-step to produce a keywords list, and then highlight two representative studies to illustrate the usefulness of the method. In our critical assessment of the keywords method, we highlight issues related to corpus design and comparability, the application of statistics, and clusters and n-grams to improve the method. We also describe important software tools and other resources, as well as providing further reading.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Online calculators and downloadable spreadsheets are available at http://corpora.lancs.ac.uk/sigtest/ (accessed 25 June 2019) and http://ucrel.lancs.ac.uk/llwizard.html (accessed 25 June 2019).
2.
It should be noted that this formula represents the 2-cell calculation (Rayson and Garside 2000) which can be used since the contribution from the other two cells is fairly constant and does not affect the ranking order. Other tools, e.g. AntConc, and statistical calculators also support the 4-cell calculation incorporating contributions from frequencies of the other words into the Log-Likelihood value.

References

Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4).
Google Scholar
Baker, P. (2017). British and American English: Divided by a common language? Cambridge: Cambridge University Press.
Book Google Scholar
Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British Press. Cambridge: Cambridge University Press.
Book Google Scholar
Baron, A., Rayson, P., & Archer, D. (2009). Word frequency and key word statistics in corpus linguistics. Anglistik, 20(1), 41–67.
Google Scholar
Boneva, B., & Kraut, R. (2002). Email, gender, and personal relations. In B. Wellman & C. Haythornthwaite (Eds.), The internet in everyday life (pp. 372–403). Oxford: Blackwell.
Chapter Google Scholar
Brezina, V., & Meyerhoff, M. (2014). Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28.
Article Google Scholar
Cressie, N., & Read, T. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B: Methodological, 46(3), 440–464.
Google Scholar
Crossley, S. A., Defore, C., Kyle, K., Dai, J., & McNamara, D. S. (2013). Paragraph specific n-gram approaches to automatically assessing essay quality. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.), Proceedings of the 6th international conference on educational data mining (pp. 216–219). Heidelberg/Berlin: Springer.
Google Scholar
Culpeper, J. (2009). Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59.
Article Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analysis. Corpora, 14(1), 77–104.
Article Google Scholar
Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory, 1(2), 277–294.
Article Google Scholar
Hardie, A. (2014). Log Ratio – an informal introduction. CASS blog: http://cass.lancs.ac.uk/?p=1133. Accessed 25 June 2019.
Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Bergen, Norway: The Norwegian Computing Centre for the Humanities.
Google Scholar
Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Paris: Mouton &.
Google Scholar
Kilgarriff, A. (1996). Why chi-square doesn’t work, and an improved LOB-Brown comparison. In Proceedings of the ALLC-ACH conference (pp. 169–172). Bergen: Norway.
Google Scholar
Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276.
Article Google Scholar
Kyle, K., Crossley, S., Daim J., & McNamara, D. (2013, June 13). Native language identification: A key N-gram category approach. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 242–250). Atlanta, Georgia.
Google Scholar
Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.
Article Google Scholar
Mahlberg, M. (2008). Clusters, key clusters and local textual functions in Dickens. Corpora, 2(1), 1–31.
Article Google Scholar
Murphy, B. (2010). Corpus and sociolinguistics: Investigating age and gender in female talk. Amsterdam: John Benjamins.
Book Google Scholar
Paquot, M. (2013). Lexical bundles and transfer effects. International Journal of Corpus Linguistics, 18(3), 391–417.
Article Google Scholar
Paquot, M. (2014). Cross-linguistic influence and formulaic language: Recurrent word sequences in French learner writing. In L. Roberts, I. Vedder, & J. Hulstijn (Eds.), EUROSLA yearbook (pp. 216–237). Amsterdam: Benjamins.
Google Scholar
Paquot, M. (2017). L1 frequency in foreign language acquisition: Recurrent word combinations in French and Spanish EFL learner writing. Second Language Research, 33(1), 13–32.
Article Google Scholar
Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In A. Jucker, D. Schreier, & M. Hundt (Eds.), Corpora: Pragmatics and discourse (pp. 247–269). Amsterdam: Rodopi.
Google Scholar
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.
Article Google Scholar
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000), 1–8 October 2000, Hong Kong (pp. 1–6).
Google Scholar
Rayson, P., & Wilson, A. (1996). The ACAMRIT semantic tagging system: Progress report. In L. J. Evett & T. G. Rose (Eds.), Language engineering for document analysis and recognition, LEDAR, AISB96 workshop proceedings (pp. 13–20). Brighton: Faculty of Engineering and Computing, Nottingham Trent University, UK.
Google Scholar
Rayson, P., Berridge, D., & Francis, B. (2004a, March 10–12). Extending the Cochran rule for the comparison of word frequencies between corpora. In Purnelle, G., Fairon, C., & Dister, A. (Eds.) Le poids des mots: Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004) (Vol. II, pp. 926–936), Louvain-la-Neuve: Presses Universitaires de Louvain.
Google Scholar
Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004b). The UCREL semantic analysis system. In Proceedings of the workshop on beyond named entity recognition semantic labelling for NLP tasks in association with 4th international conference on language resources and evaluation (LREC 2004), 7–12. 25th may 2004, Lisbon, Portugal. Paris: European Language Resources Association.
Google Scholar
Scott, M. (1997). PC analysis of key words – And key key words. System, 25(2), 233–245.
Article Google Scholar
Scott, M. (2004). WordSmith tools. Version 4.0. Oxford: Oxford University Press. ISBN: 0-19-459400-9.
Google Scholar
Seale, C., Ziebland, S., & Charteris-Black, J. (2006). Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science & Medicine, 62, 2577–2590.
Article Google Scholar
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Google Scholar
Tono, Y., Yamazaki, M., & Maekawa, K. (2013). A frequency dictionary of Japanese. Routledge.
Google Scholar
Vasishth, S., & Nicenboim, B. (2016). Statistical methods for linguistic research: Foundational ideas – Part I. Lang & Ling Compass, 10, 349–369. https://doi.org/10.1111/lnc3.12201.
Article Google Scholar
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.
Article Google Scholar
Wilson, A. (2013). Embracing Bayes factors for key item analysis in corpus linguistics. In New approaches to the study of linguistic variability. Language competence and language awareness in Europe (pp. 3–11). Frankfurt: Peter Lang.
Google Scholar
Wilson, A., & Rayson, P. (1993). Automatic content analysis of spoken discourse. In C. Souter & E. Atwell (Eds.), Corpus based computational linguistics (pp. 215–226). Amsterdam: Rodopi.
Google Scholar

Download references

Author information

Authors and Affiliations

Lancaster University, Bailrigg, Lancaster, UK
Paul Rayson
Cardiff University, Cardiff, UK
Amanda Potts

Authors

Paul Rayson
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Potts
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul Rayson .

Editor information

Editors and Affiliations

FNRS Centre for English Corpus Linguistics, Language and Communication Institute, UCLouvain, Louvain-la-Neuve, Belgium
Magali Paquot
Department of Linguistics, University of California, Santa Barbara, CA, USA
Stefan Th. Gries

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rayson, P., Potts, A. (2020). Analysing Keyword Lists. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-46216-1_6
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics