Coh-Metrix: Analysis of text on cohesion and language

Abstract

Advances in computational linguistics and discourse processing have made it possible to automate many language- and text-processing mechanisms. We have developed a computer tool called Coh-Metrix, which analyzes texts on over 200 measures of cohesion, language, and readability. Its modules use lexicons, part-of-speech classifiers, syntactic parsers, templates, corpora, latent semantic analysis, and other components that are widely used in computational linguistics. After the user enters an English text, Coh-Metrix returns measures requested by the user. In addition, a facility allows the user to store the results of these analyses in data files (such as Text, Excel, and SPSS). Standard text readability formulas scale texts on difficulty by relying on word length and sentence length, whereas Coh-Metrix is sensitive to cohesion relations, world knowledge, and language and discourse characteristics.

References

  1. Allen, J. (1995).Natural language understanding. Redwood City, CA: Benjamin/Cummings.

    Google Scholar 

  2. Baayen, R. H., Piepenbrock, R., &Gulikers, L. (1995).The CELEX lexical database (CD-ROM). Philadelphia: University of Pennsylvania, Linguistic Data Consortium.

    Google Scholar 

  3. Belew, R. K. (2002). Finding out about: A cognitive perspective on search engine technology and the WWW.Information Retrieval,5,269–278.

    Article  Google Scholar 

  4. Biber, D., Conrad, S., &Reppen, R. (1998).Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.

    Google Scholar 

  5. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging.Computational Linguistics,21543–566.

    Google Scholar 

  6. Brooks, C., &Warren, R. P. (1972).Modern rhetoric. New York: Harcourt Brace Jovanovich.

    Google Scholar 

  7. Brown, G. D. A. (1984). A frequency count of 190,000 words in theLondon-Lund Corpus of English Conversation.Behavior Research Methods, Instruments, & Computers,16, 502–532.

    Article  Google Scholar 

  8. Burgess, C., Livesay, K., &Lund, K. (1998). Explorations in context space: Words, sentences, and discourse.Discourse Processes,25, 211–257.

    Article  Google Scholar 

  9. Coltheart, M. (1981). The MRC psycholinguistic database.Quarterly Journal of Experimental Psychology,33A, 497–505.

    Google Scholar 

  10. DARPA (1995).Proceedings of the Sixth Message Understanding Conference (MUC-6). San Francisco: Morgan Kaufman.

    Google Scholar 

  11. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., &Harshman, R. (1990). Indexing by latent semantic analysis.Journal of the American Society for Information Science,41,391–407.

    Article  Google Scholar 

  12. Fellbaum, C. (Ed.) (1998).WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    Google Scholar 

  13. Foltz, P. W. (1996). Latent semantic analysis for text-based research.Behavior Research Methods, Instruments, & Computers,28,197–202.

    Article  Google Scholar 

  14. Francis, W. N., &Kucera, H. (1982).Frequency analysis of English usage. Boston: Houghton-Mifflin.

    Google Scholar 

  15. Gernsbacher, M. A., &Faust, M. (1991). The mechanism of suppression: A component of general comprehension skill.Journal of Experimental Psychology: Learning, Memory, & Cognition,17,245–262.

    Article  Google Scholar 

  16. Gilhooly, K. J., &Logie, R. H. (1980). Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words.Behavior Research Methods & Instrumentation,12, 395–427.

    Article  Google Scholar 

  17. Graesser, A. C., Gernsbacher, M. A., &Goldman, S. R. (2003). Introduction to theHandbook of discourse processes. In A. C. Graesser, M. A. Gernsbacher, and S. R. Goldman (Eds.),Handbook of discourse processes (pp. 1–24). Mahwah, NJ: Erlbaum.

    Google Scholar 

  18. Graesser, A. C., Burger, J., Carrol, J., Corbett, A., Ferro, L., Gordon, D., Greiff, W., Harabagiu, S., Howell, K., Kelly, H., Litman, D., Louwerse, M., Moore, A., Pell, A., Prange, J., Voorhees, E., & Ward, W. (2003).Question generation and answering systems: R&D for technology-enabled learning systems. Research roadmap for the Federation of American Sciences. Unpublished manuscript.

  19. Graesser, A. C., Karnavat, A. B., Daniel, F. K., Cooper, E., Whitten, S. N., &Louwerse, M. (2001). A computer tool to improve questionnaire design. InStatistical Policy Working Paper 33, Federal Committee on Statistical Methodology (pp. 36–48). Washington, DC: Bureau of Labor Statistics.

    Google Scholar 

  20. Graesser, A. C., McNamara, D. S., &Louwerse, M. M. (2003). What do readers need to learn in order to process coherence relations in narrative and expository text? In A. P. Sweet & C. E. Snow (Eds.),Rethinking reading comprehension (pp. 82–98). New York: Guilford.

    Google Scholar 

  21. Graesser, A. C., Person, N., Harter, D., &the Tutoring Research Group (2001). Teaching tactics and dialog in AutoTutor.International Journal of Artificial Intelligence in Education,12, 257–279.

    Google Scholar 

  22. Graesser, A. C., Singer, M., &Trabasso, T. (1994). Constructing inferences during narrative text comprehension.Psychological Review,101,371–395.

    PubMed  Article  Google Scholar 

  23. Graesser, A. C., VanLehn, K., Rose, C. P., Jordan, P. W., &Harter, D. (2001). Intelligent tutoring systems with conversational dialogue.AI Magazine,22(4), 39–52.

    Google Scholar 

  24. Graesser, A. C., Wiemer-Hastings, K., Kreuz, R., Wiemer-Hastings, P., &Marquis, K. (2000). QUAID: A questionnaire evaluation aid for survey methodologists.Behavior Research Methods, Instruments, & Computers,32, 254–262.

    Article  Google Scholar 

  25. Haberlandt, K., &Graesser, A. C. (1985). Component processes in text comprehension and some of their interactions.Journal of Experimental Psychology: General,114,357–374.

    Article  Google Scholar 

  26. Halliday, M. A., &Hasan, R. (1976).Cohesion in English. London: Longman.

    Google Scholar 

  27. Jurafsky, D., &Martin, J. H. (2000).Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Prentice-Hall.

    Google Scholar 

  28. Just, M. A. &Carpenter, P. A. (1980). A theory of reading: From eye fixations to comprehension.Psychological Review,87,329–354.

    PubMed  Article  Google Scholar 

  29. Kintsch, W. (1998).Comprehension: Aparadigmfor cognition. Cambridge: Cambridge University Press.

    Google Scholar 

  30. Kintsch, W., &van Dijk, T. A. (1978). Toward a model of text comprehension and production.Psychological Review,85,363–394.

    Article  Google Scholar 

  31. Klare, G. R. (1974–1975). Assessing readability.Reading Research Quarterly,10, 62–102.

    Article  Google Scholar 

  32. Landauer, T. K., &Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychological Review,104,211–240.

    Article  Google Scholar 

  33. Landauer, T. K., Foltz, P. W., &Laham, D. (1998). An introduction to latent semantic analysis.Discourse Processes,25, 259–284.

    Article  Google Scholar 

  34. Lehnert, W. G. (1997). Information extraction: What have we learned?Discourse Processes,23, 441–470.

    Article  Google Scholar 

  35. Lehnert, W. G., &Ringle, M. H. (Eds.) (1982).Strategies for natural language processing. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  36. Louwerse, M. M. (2002). An analytic and cognitive parameterization of coherence relations.Cognitive Linguistics,12,291–315.

    Article  Google Scholar 

  37. Louwerse, M. M., & Graesser, A. C. (in press). Coherence in discourse. In P. Strazny (Ed.),Encyclopedia of linguistics. Chicago: Fitzroy Dearborn.

  38. Louwerse, M. M., &Mitchell, H. H. (2003). Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational linguistic account.Discourse Processes,35, 199–239.

    Article  Google Scholar 

  39. Marcus, M., Santorini, B., &Marcinkiewicz, M. (1993). Building a large annotated corpus of English: The Penn Treebank.Computational Linguistics,19,313–330.

    Google Scholar 

  40. McNamara, D. S. (2001). Reading both high and low coherence texts: Effects of text sequence and prior knowledge.Canadian Journal of Experimental Psychology,55,51–62.

    PubMed  Google Scholar 

  41. McNamara, D. S., Kintsch, E., Songer, N. B., &Kintsch, W. (1996). Are good texts always better? Text coherence, background knowledge, and levels of understanding in learning from text.Cognition & Instruction,14,1–43.

    Article  Google Scholar 

  42. McNamara, D. S., &Kintsch, W. (1996). Learning from text: Effects of prior knowledge and text coherence.Discourse Processes,22,247–287.

    Article  Google Scholar 

  43. McNamara, D. S., &McDaniel, M. (2004). Suppressing irrelevant information: Knowledge activation or inhibition?Journal of Experimental Psychology: Learning, Memory, & Cognition,30,465–482.

    Article  Google Scholar 

  44. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., &Miller, K. (1990).Five papers on WordNet (Tech. Rep. No. 43). Princeton, NJ: Princeton University, Cognitive Science Laboratory.

    Google Scholar 

  45. Moore, J. D., &Wiemer-Hastings, P. (2003). Discourse in computational linguistics and artificial intelligence. In A. C. Graesser, M. A. Gernsbacher, and S. R. Goldman (Eds.),Handbook of discourse processes (pp. 439–486). Mahwah, NJ: Erlbaum.

    Google Scholar 

  46. Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery and meaningfulness values for 925 words.Journal of Experimental Psychology Monograph Supplements,76(3, Part 2).

  47. Pennebaker, J. W., &Francis, M. E. (1999).Linguistic inquiry and word count (LIWC). Mahwah, NJ: Erlbaum.

    Google Scholar 

  48. Robertson, S. (2001).Evaluation in information retrieval: Lectures on information retrieval. New York: Springer-Verlag.

    Google Scholar 

  49. Schank, R., &Riesbeck, C. K. (Eds.) (1981).Inside computer understanding. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  50. Sekine, S., &Grishman, R. (1995). A corpus-based probabilistic grammar with only two nonterminals. InFour th International Workshop on Parsing Technologies (pp. 260–270). Prague: Karlovy Vary.

    Google Scholar 

  51. Thorndike, E. L., &Lorge, I. (1944).The teacher’s word book of 30,000 words. New York: Teachers College.

    Google Scholar 

  52. Toglia, M. P., &Battig, W. R. (1978).Handbook of semantic word norms. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  53. Trabasso, T., &van den Broek, P. (1985). Causal thinking and the representation of narrative events.Journal of Memory & Language,24,612–630.

    Article  Google Scholar 

  54. van den Broek, P., Virtue, S., Everson, M. G., Tzeng, Y., &Sung, Y. (2002). Comprehension and memory of science texts: Inferential processes and the construction of a mental representation. In J. Otero, J. Leon, & A. C. Graesser (Eds.),The psychology of science text comprehension (pp. 131–154). Mahwah, NJ: Erlbaum.

    Google Scholar 

  55. Voorhees, E. (2001). The TREC Question Answering Track.Natural Language Engineering,7,361–378.

    Article  Google Scholar 

  56. Zipf, G. (1949).Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley.

    Google Scholar 

  57. Zwaan, R. A., &Radvansky, G. A. (1998). Situation models in language comprehension and memory.Psychological Bulletin,123162–185.

    PubMed  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Arthur C. Graesser or Danielle S. McNamara.

Additional information

The research was supported by Institute for Education Sciences Grant IES R3056020018-02 and National Science Foundation Grant SES 9977969. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the IES or the NSF.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Graesser, A.C., McNamara, D.S., Louwerse, M.M. et al. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36, 193–202 (2004). https://doi.org/10.3758/BF03195564

Download citation

Keywords

  • Latent Semantic Analysis
  • Content Word
  • World Knowledge
  • Discourse Process
  • Sentence Pair