Skip to main content
  • 1861 Accesses

Abstract

This chapter exemplifies how texts that have been part-of-speech (POS) tagged and lemmatized can be effectively queried and analyzed using a combination of command line interface tools and specialized programs for lexical analysis to address a variety of analytical needs. We first discuss the generation and analysis of frequency lists that contain POS and lemma information. We then discuss the analysis of n-gram lists with POS and lemma information. Finally, we briefly review some of the measures that have been commonly used to assess lexical density, variation and sophistication in previous research and introduce a number of tools for automating lexical richness analysis using these measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is calculated as (n*1000)/N, where n denotes the raw frequency of the item in question and N denotes corpus size (i.e., total number of tokens in the corpus).

  2. 2.

    To exclude modal verbs, remove M from the pattern. To exclude wh-adverbs (tagged as WRB), enclose the part of the pattern after ^ in parentheses (i.e., /^([NVJ]|RB)/) so that tags that contain but do not start with RB are disqualified.

  3. 3.

    <http://aihaiyang.com/synlex/lexical/>.

  4. 4.

    <http://www.personal.psu.edu/xxl13/downloads/lca.html>.

  5. 5.

    <http://childes.psy.cmu.edu>.

  6. 6.

    The current version can be found at <http://childes.psy.cmu.edu/manuals/clan.pdf>.

  7. 7.

    <http://www.ai.uga.edu/caspr/>.

  8. 8.

    <http://www.mono-project.com>.

  9. 9.

    A pre-installed version of X11 is included in Mac OS X versions 10.5 through 10.7. For Mac OS X 10.8, X11 can be downloaded from <http://xquartz.macosforge.org/>.

  10. 10.

    <https://umdrive.memphis.edu/pmmccrth/public/software/software_index.htm>.

  11. 11.

    <http://www.victoria.ac.nz/lals/resources/range>.

  12. 12.

    A word family generally includes the base form of a word, its inflectional forms, and its commonly used derivational forms. For more information, see Bauer and Nation (1993).

  13. 13.

    <http://www.lextutor.ca/>.

References

  • Ai, H., and X. Lu. 2010. A web-based system for automatic measurement of lexical complexity. Paper presented at the 27th Annual Symposium of the Computer-Assisted Language Instruction Consortium. Amherst, MA.

    Google Scholar 

  • Anthony, L. 2010. AntConc, Version 3.2.1. Tokyo: Waseda University. http://www.antlab.sci.waseda.ac.jp. Accessed 11 May 2013.

  • Bauer, L., and P. Nation. 1993. Word families. International Journal of Lexicography 6:253–279.

    Article  Google Scholar 

  • Besnier, N. 1988. The linguistic relationship of spoken and written Nukulaelae registers. Language 64:707–736.

    Article  Google Scholar 

  • Biber, D. 1988. Linguistic features: Algorithms and functions in variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. 2006. University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Biber, D., S. Conrad, and V. Cortes. 2004. If you look at …: Lexical bundles in university teaching and classrooms. Applied Linguistics 25:371–405.

    Article  Google Scholar 

  • Carroll, J. B. 1964. Language and thought. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Cobb, T., and M. Horst. 2011. Does word coach coach words? CALICO Journal 28:639–661.

    Google Scholar 

  • Covington, M. A., and J. D. McFall. 2010. Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of quantitative linguistics 17:94–100.

    Article  Google Scholar 

  • Coxhead, A. 2000. A new academic word list. TESOL Quarterly 34:213–238.

    Article  Google Scholar 

  • Engber, C. A. 1995. The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing 4:139–155.

    Article  Google Scholar 

  • Guiraud, P. 1960. Problèmes et méthodes de la statistique linguistique [Problems and methods of statistical linguistics]. Dordrecht: D. Reidel.

    Google Scholar 

  • Halliday, M. A. K. 1985. Spoken and written language. Melbourne: Deakin University Press.

    Google Scholar 

  • Heatley, A., I. S. P. Nation, and A. Coxhead. 2002. RANGE and FREQUENCY programs. Wellington: Victoria University of Wellington. http://www.victoria.ac.nz/lals/resources/range. Accessed 11 May 2013.

  • Herdan, G. 1964. Quantitative linguistics. London: Butterworths.

    Google Scholar 

  • Hess, C. W., K. M. Sefton, and R. G. Landry. 1986. Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research 29:129–134.

    Google Scholar 

  • Hyltenstam, K. 1988. Lexical characteristics of near-native second-language learners of Swedish. Journal of Multilingual and Multicultural Development 9:67–84.

    Article  Google Scholar 

  • Johnson, W. 1944. Studies in language behavior: I. A program of research. Psychological Monographs 56:1–15.

    Article  Google Scholar 

  • Kong, K. 2009. A comparison of the linguistic and interactional features of language learning websites and textbooks. Computer Assisted Language Learning 22:31–55.

    Article  Google Scholar 

  • Laufer, B. 1994. The lexical profile of second language writing: Does it change over time? RELC Journal 25:21–33.

    Article  Google Scholar 

  • Laufer, B., and P. Nation. 1995. Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics 16:307–322.

    Article  Google Scholar 

  • Linnarud, M. 1986. Lexis in composition: A performance analysis of Swedish learnersʼ written English. Lund: CWK Gleerup.

    Google Scholar 

  • Lu, X. 2012. The relationship of lexical richness to the quality of ESL learnersʼ oral narratives. The Modern Language Journal 96:190–208.

    Article  Google Scholar 

  • Maas, H. D. 1972. Zusammenhang zwischen wortschatzumfang und länge eines textes [Relationship between vocabulary size and text length]. Zeitschrift fuÌr Literaturwissenschaft und Linguistik [Journal of Literature and Linguistics] 8:73–79.

    Google Scholar 

  • MacWhinney, B. 2000. The CHILDES project: Tools for analyzing talk. Mahwah: Erlbaum.

    Google Scholar 

  • Malvern, D., B. Richards, N. Chipere, and P. Durán. 2004. Lexical diversity and language development: Quantification and assessment. Houndmills: Palgrave MacMillan.

    Book  Google Scholar 

  • Mann, M. B. 1944. Studies in language behavior: III. The quantitative differentiation of samples of written language. Psychological Monographs 56:41–74.

    Article  Google Scholar 

  • Manschreck, T. C., B. A. Maher, and D. N. Ader. 1981. Formal thought disorder, the type token ratio, and disturbed voluntary movement in schizophrenia. British Journal of Psychiatry 139:7–15.

    Article  Google Scholar 

  • McCarthy, P. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the Measure of Textual, Lexical Diversity (MTLD). Unpublished doctoral dissertation, University of Memphis.

    Google Scholar 

  • McCarthy, P. M., and S. Jarvis. 2007. A theoretical and empirical evaluation of vocd. Language Testing 24:459–488.

    Article  Google Scholar 

  • McCarthy, P. M., and S. Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42:381–392.

    Article  Google Scholar 

  • McCarthy, P. M., S. Watanabe, and T. A. Lamkin. 2012. The gramulator: A tool to identify differential linguistic features of correlative text types. In Applied natural language processing and content analysis: Identification, investigation, and resolution, eds. P. M. McCarthy and C. Boonthum, 312–333. Hershey: IGI Global.

    Google Scholar 

  • McKee, G., D. Malvern, and B. Richards. 2000. Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing 15:323–337.

    Article  Google Scholar 

  • Meara, P. 1978. Schizophrenic symptoms in foreign language learners. UEA Papers in Linguistics 7:22–49.

    Google Scholar 

  • Minnen, G., J. Carroll, and D. Pearce. 2001. Applied morphological processing of english. Natural Language Engineering 7:207–223.

    Article  Google Scholar 

  • Nation, I. S. P. 1984. Vocabulary lists. Wellington: Victoria University of Wellington, English Language Institute.

    Google Scholar 

  • O’Loughlin, K. 1995. Lexical density in candidate output of direct and semi-direct versions of an oral proficiency test. Language Testing 12:217–237.

    Article  Google Scholar 

  • Read, J. 2000. Assessing vocabulary. Oxford: Oxford University Press.

    Book  Google Scholar 

  • Richards, B. J., and D. D. Malvern. 1997. Quantifying lexical diversity in the study of language development: New Bulmershe papers. Reading: University of Reading.

    Google Scholar 

  • Templin, M. 1957. Certain language skills in children: Their development and interrelationships. Minneapolis: The University of Minnesota Press.

    Google Scholar 

  • Toutanova, K., D. Klein, C. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technologies: The 2003 Conference of the North American Chapter of the Association for Computational Linguistics, 252–259. Stroudsburg: Association for Computational Linguistics.

    Google Scholar 

  • Ure, J. 1971. Lexical density: A computational technique and some findings. In Talking about text, ed. M. Coultard, 27–48. Birmingham: English Language Research, University of Birmingham.

    Google Scholar 

  • West, M. 1953. A general service list of english words. London: Longman.

    Google Scholar 

  • Wray, A. 2002. Formulaic language and the lexicon. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Xue, G., and P. Nation. 1984. A university word list. Language Learning and Communication 3:215–229.

    Google Scholar 

  • Yu, G. 2010. Lexical diversity in writing and speaking task performances. Applied Linguistics 31:236–259.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofei Lu .

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Lu, X. (2014). Lexical Analysis. In: Computational Methods for Corpus Annotation and Analysis. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8645-4_4

Download citation

Publish with us

Policies and ethics