Abstract
This chapter exemplifies how texts that have been part-of-speech (POS) tagged and lemmatized can be effectively queried and analyzed using a combination of command line interface tools and specialized programs for lexical analysis to address a variety of analytical needs. We first discuss the generation and analysis of frequency lists that contain POS and lemma information. We then discuss the analysis of n-gram lists with POS and lemma information. Finally, we briefly review some of the measures that have been commonly used to assess lexical density, variation and sophistication in previous research and introduce a number of tools for automating lexical richness analysis using these measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This is calculated as (n*1000)/N, where n denotes the raw frequency of the item in question and N denotes corpus size (i.e., total number of tokens in the corpus).
- 2.
To exclude modal verbs, remove M from the pattern. To exclude wh-adverbs (tagged as WRB), enclose the part of the pattern after ^ in parentheses (i.e., /^([NVJ]|RB)/) so that tags that contain but do not start with RB are disqualified.
- 3.
- 4.
- 5.
- 6.
The current version can be found at <http://childes.psy.cmu.edu/manuals/clan.pdf>.
- 7.
- 8.
- 9.
A pre-installed version of X11 is included in Mac OS X versions 10.5 through 10.7. For Mac OS X 10.8, X11 can be downloaded from <http://xquartz.macosforge.org/>.
- 10.
- 11.
- 12.
A word family generally includes the base form of a word, its inflectional forms, and its commonly used derivational forms. For more information, see Bauer and Nation (1993).
- 13.
References
Ai, H., and X. Lu. 2010. A web-based system for automatic measurement of lexical complexity. Paper presented at the 27th Annual Symposium of the Computer-Assisted Language Instruction Consortium. Amherst, MA.
Anthony, L. 2010. AntConc, Version 3.2.1. Tokyo: Waseda University. http://www.antlab.sci.waseda.ac.jp. Accessed 11 May 2013.
Bauer, L., and P. Nation. 1993. Word families. International Journal of Lexicography 6:253–279.
Besnier, N. 1988. The linguistic relationship of spoken and written Nukulaelae registers. Language 64:707–736.
Biber, D. 1988. Linguistic features: Algorithms and functions in variation across speech and writing. Cambridge: Cambridge University Press.
Biber, D. 2006. University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins.
Biber, D., S. Conrad, and V. Cortes. 2004. If you look at …: Lexical bundles in university teaching and classrooms. Applied Linguistics 25:371–405.
Carroll, J. B. 1964. Language and thought. Englewood Cliffs: Prentice-Hall.
Cobb, T., and M. Horst. 2011. Does word coach coach words? CALICO Journal 28:639–661.
Covington, M. A., and J. D. McFall. 2010. Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of quantitative linguistics 17:94–100.
Coxhead, A. 2000. A new academic word list. TESOL Quarterly 34:213–238.
Engber, C. A. 1995. The relationship of lexical proficiency to the quality of ESL compositions. Journal of Second Language Writing 4:139–155.
Guiraud, P. 1960. Problèmes et méthodes de la statistique linguistique [Problems and methods of statistical linguistics]. Dordrecht: D. Reidel.
Halliday, M. A. K. 1985. Spoken and written language. Melbourne: Deakin University Press.
Heatley, A., I. S. P. Nation, and A. Coxhead. 2002. RANGE and FREQUENCY programs. Wellington: Victoria University of Wellington. http://www.victoria.ac.nz/lals/resources/range. Accessed 11 May 2013.
Herdan, G. 1964. Quantitative linguistics. London: Butterworths.
Hess, C. W., K. M. Sefton, and R. G. Landry. 1986. Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research 29:129–134.
Hyltenstam, K. 1988. Lexical characteristics of near-native second-language learners of Swedish. Journal of Multilingual and Multicultural Development 9:67–84.
Johnson, W. 1944. Studies in language behavior: I. A program of research. Psychological Monographs 56:1–15.
Kong, K. 2009. A comparison of the linguistic and interactional features of language learning websites and textbooks. Computer Assisted Language Learning 22:31–55.
Laufer, B. 1994. The lexical profile of second language writing: Does it change over time? RELC Journal 25:21–33.
Laufer, B., and P. Nation. 1995. Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics 16:307–322.
Linnarud, M. 1986. Lexis in composition: A performance analysis of Swedish learnersʼ written English. Lund: CWK Gleerup.
Lu, X. 2012. The relationship of lexical richness to the quality of ESL learnersʼ oral narratives. The Modern Language Journal 96:190–208.
Maas, H. D. 1972. Zusammenhang zwischen wortschatzumfang und länge eines textes [Relationship between vocabulary size and text length]. Zeitschrift fuÌr Literaturwissenschaft und Linguistik [Journal of Literature and Linguistics] 8:73–79.
MacWhinney, B. 2000. The CHILDES project: Tools for analyzing talk. Mahwah: Erlbaum.
Malvern, D., B. Richards, N. Chipere, and P. Durán. 2004. Lexical diversity and language development: Quantification and assessment. Houndmills: Palgrave MacMillan.
Mann, M. B. 1944. Studies in language behavior: III. The quantitative differentiation of samples of written language. Psychological Monographs 56:41–74.
Manschreck, T. C., B. A. Maher, and D. N. Ader. 1981. Formal thought disorder, the type token ratio, and disturbed voluntary movement in schizophrenia. British Journal of Psychiatry 139:7–15.
McCarthy, P. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the Measure of Textual, Lexical Diversity (MTLD). Unpublished doctoral dissertation, University of Memphis.
McCarthy, P. M., and S. Jarvis. 2007. A theoretical and empirical evaluation of vocd. Language Testing 24:459–488.
McCarthy, P. M., and S. Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42:381–392.
McCarthy, P. M., S. Watanabe, and T. A. Lamkin. 2012. The gramulator: A tool to identify differential linguistic features of correlative text types. In Applied natural language processing and content analysis: Identification, investigation, and resolution, eds. P. M. McCarthy and C. Boonthum, 312–333. Hershey: IGI Global.
McKee, G., D. Malvern, and B. Richards. 2000. Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing 15:323–337.
Meara, P. 1978. Schizophrenic symptoms in foreign language learners. UEA Papers in Linguistics 7:22–49.
Minnen, G., J. Carroll, and D. Pearce. 2001. Applied morphological processing of english. Natural Language Engineering 7:207–223.
Nation, I. S. P. 1984. Vocabulary lists. Wellington: Victoria University of Wellington, English Language Institute.
O’Loughlin, K. 1995. Lexical density in candidate output of direct and semi-direct versions of an oral proficiency test. Language Testing 12:217–237.
Read, J. 2000. Assessing vocabulary. Oxford: Oxford University Press.
Richards, B. J., and D. D. Malvern. 1997. Quantifying lexical diversity in the study of language development: New Bulmershe papers. Reading: University of Reading.
Templin, M. 1957. Certain language skills in children: Their development and interrelationships. Minneapolis: The University of Minnesota Press.
Toutanova, K., D. Klein, C. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technologies: The 2003 Conference of the North American Chapter of the Association for Computational Linguistics, 252–259. Stroudsburg: Association for Computational Linguistics.
Ure, J. 1971. Lexical density: A computational technique and some findings. In Talking about text, ed. M. Coultard, 27–48. Birmingham: English Language Research, University of Birmingham.
West, M. 1953. A general service list of english words. London: Longman.
Wray, A. 2002. Formulaic language and the lexicon. Cambridge: Cambridge University Press.
Xue, G., and P. Nation. 1984. A university word list. Language Learning and Communication 3:215–229.
Yu, G. 2010. Lexical diversity in writing and speaking task performances. Applied Linguistics 31:236–259.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Lu, X. (2014). Lexical Analysis. In: Computational Methods for Corpus Annotation and Analysis. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8645-4_4
Download citation
DOI: https://doi.org/10.1007/978-94-017-8645-4_4
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-017-8644-7
Online ISBN: 978-94-017-8645-4
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)