Abstract
Frequency lists are useful in their own right for assisting a linguist, lexicographer, language teacher, or learner analyse or exploit a corpus. When employed comparatively through the keywords approach, significant changes in the relative ordering of words can flag points of interest. This conceptually simple approach of comparing one frequency list against another has been very widely exploited in corpus linguistics to help answer a vast number of research questions. In this chapter, we describe the method step-by-step to produce a keywords list, and then highlight two representative studies to illustrate the usefulness of the method. In our critical assessment of the keywords method, we highlight issues related to corpus design and comparability, the application of statistics, and clusters and n-grams to improve the method. We also describe important software tools and other resources, as well as providing further reading.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Online calculators and downloadable spreadsheets are available at http://corpora.lancs.ac.uk/sigtest/ (accessed 25 June 2019) and http://ucrel.lancs.ac.uk/llwizard.html (accessed 25 June 2019).
- 2.
It should be noted that this formula represents the 2-cell calculation (Rayson and Garside 2000) which can be used since the contribution from the other two cells is fairly constant and does not affect the ranking order. Other tools, e.g. AntConc, and statistical calculators also support the 4-cell calculation incorporating contributions from frequencies of the other words into the Log-Likelihood value.
References
Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4).
Baker, P. (2017). British and American English: Divided by a common language? Cambridge: Cambridge University Press.
Baker, P., Gabrielatos, C., & McEnery, T. (2013). Discourse analysis and media attitudes: The representation of Islam in the British Press. Cambridge: Cambridge University Press.
Baron, A., Rayson, P., & Archer, D. (2009). Word frequency and key word statistics in corpus linguistics. Anglistik, 20(1), 41–67.
Boneva, B., & Kraut, R. (2002). Email, gender, and personal relations. In B. Wellman & C. Haythornthwaite (Eds.), The internet in everyday life (pp. 372–403). Oxford: Blackwell.
Brezina, V., & Meyerhoff, M. (2014). Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28.
Cressie, N., & Read, T. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B: Methodological, 46(3), 440–464.
Crossley, S. A., Defore, C., Kyle, K., Dai, J., & McNamara, D. S. (2013). Paragraph specific n-gram approaches to automatically assessing essay quality. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.), Proceedings of the 6th international conference on educational data mining (pp. 216–219). Heidelberg/Berlin: Springer.
Culpeper, J. (2009). Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analysis. Corpora, 14(1), 77–104.
Gries, S. T. (2005). Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory, 1(2), 277–294.
Hardie, A. (2014). Log Ratio – an informal introduction. CASS blog: http://cass.lancs.ac.uk/?p=1133. Accessed 25 June 2019.
Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Bergen, Norway: The Norwegian Computing Centre for the Humanities.
Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Paris: Mouton &.
Kilgarriff, A. (1996). Why chi-square doesn’t work, and an improved LOB-Brown comparison. In Proceedings of the ALLC-ACH conference (pp. 169–172). Bergen: Norway.
Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276.
Kyle, K., Crossley, S., Daim J., & McNamara, D. (2013, June 13). Native language identification: A key N-gram category approach. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 242–250). Atlanta, Georgia.
Lijffijt, J., Nevalainen, T., Säily, T., Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Literary and Linguistic Computing, 31(2), 374–397.
Mahlberg, M. (2008). Clusters, key clusters and local textual functions in Dickens. Corpora, 2(1), 1–31.
Murphy, B. (2010). Corpus and sociolinguistics: Investigating age and gender in female talk. Amsterdam: John Benjamins.
Paquot, M. (2013). Lexical bundles and transfer effects. International Journal of Corpus Linguistics, 18(3), 391–417.
Paquot, M. (2014). Cross-linguistic influence and formulaic language: Recurrent word sequences in French learner writing. In L. Roberts, I. Vedder, & J. Hulstijn (Eds.), EUROSLA yearbook (pp. 216–237). Amsterdam: Benjamins.
Paquot, M. (2017). L1 frequency in foreign language acquisition: Recurrent word combinations in French and Spanish EFL learner writing. Second Language Research, 33(1), 13–32.
Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In A. Jucker, D. Schreier, & M. Hundt (Eds.), Corpora: Pragmatics and discourse (pp. 247–269). Amsterdam: Rodopi.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000), 1–8 October 2000, Hong Kong (pp. 1–6).
Rayson, P., & Wilson, A. (1996). The ACAMRIT semantic tagging system: Progress report. In L. J. Evett & T. G. Rose (Eds.), Language engineering for document analysis and recognition, LEDAR, AISB96 workshop proceedings (pp. 13–20). Brighton: Faculty of Engineering and Computing, Nottingham Trent University, UK.
Rayson, P., Berridge, D., & Francis, B. (2004a, March 10–12). Extending the Cochran rule for the comparison of word frequencies between corpora. In Purnelle, G., Fairon, C., & Dister, A. (Eds.) Le poids des mots: Proceedings of the 7th International Conference on Statistical analysis of textual data (JADT 2004) (Vol. II, pp. 926–936), Louvain-la-Neuve: Presses Universitaires de Louvain.
Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004b). The UCREL semantic analysis system. In Proceedings of the workshop on beyond named entity recognition semantic labelling for NLP tasks in association with 4th international conference on language resources and evaluation (LREC 2004), 7–12. 25th may 2004, Lisbon, Portugal. Paris: European Language Resources Association.
Scott, M. (1997). PC analysis of key words – And key key words. System, 25(2), 233–245.
Scott, M. (2004). WordSmith tools. Version 4.0. Oxford: Oxford University Press. ISBN: 0-19-459400-9.
Seale, C., Ziebland, S., & Charteris-Black, J. (2006). Gender, cancer experience and internet use: A comparative keyword analysis of interviews and online cancer support groups. Social Science & Medicine, 62, 2577–2590.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Tono, Y., Yamazaki, M., & Maekawa, K. (2013). A frequency dictionary of Japanese. Routledge.
Vasishth, S., & Nicenboim, B. (2016). Statistical methods for linguistic research: Foundational ideas – Part I. Lang & Ling Compass, 10, 349–369. https://doi.org/10.1111/lnc3.12201.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.
Wilson, A. (2013). Embracing Bayes factors for key item analysis in corpus linguistics. In New approaches to the study of linguistic variability. Language competence and language awareness in Europe (pp. 3–11). Frankfurt: Peter Lang.
Wilson, A., & Rayson, P. (1993). Automatic content analysis of spoken discourse. In C. Souter & E. Atwell (Eds.), Corpus based computational linguistics (pp. 215–226). Amsterdam: Rodopi.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Rayson, P., Potts, A. (2020). Analysing Keyword Lists. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)