Skip to main content

Focused Information Retrieval & English Language Instruction: A New Text Complexity Algorithm for Automatic Text Classification

  • Conference paper
Mining Intelligence and Knowledge Exploration

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8891))

  • 1638 Accesses

Abstract

The purpose of the present study was to delineate a range of linguistic features that characterize the English reading texts used at the B2 (Independent User) and C1 (Advanced User) level of the Greek State Certificate of English Language Proficiency (KPG) exams in order to better define text complexity per level of competence. The main outcome of the research was the L.A.S.T. Text Difficulty Index that makes possible the automatic classification of B2 and C1 English reading texts based on four in-depth linguistic features, i.e. lexical density, syntactic structure similarity, tokens per word family and academic vocabulary. Given that the predictive accuracy of the formula has reached 80% on a new set of reading comprehension texts with 32 out of the 40 new texts assigned to similar levels by both raters, the practical usefulness of the index might extend to EFL testers and materials writers, who are in constant need of calibrated texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alderson, C.: Assessing Reading. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  2. Alderson, C., Figueras, N., Kuijper, H., Nold, G., Takala, S., Tardieu, C.: The development of specifications for item development and classification within The Common European Framework of Reference for Languages: Learning, Teaching, Assessment: Reading and Listening: Final report of The Dutch CEF Construct Project. Unpublished Working Paper. Lancaster University, Lancaster (2004)

    Google Scholar 

  3. Allen, D., Bernhardt, B., Berry, T., Demel, M.: Comprehension and text genre: an analysis of secondary school foreign language readers. The Modern Language Journal 72(2), 163–172 (1988)

    Article  Google Scholar 

  4. Bailin, A., Grafstein, A.: The linguistic assumptions underlying readability formulas: a critique. Language & Communication 21(3), 285–301 (2001)

    Article  Google Scholar 

  5. Beaudreau, S., Storandt, M., Strube, M.: A comparison of narratives told by younger and older adults. Experimental Aging Research 32(1), 105–117 (2005)

    Article  Google Scholar 

  6. Block, E.: See How They Read: Comprehension Monitoring of L1 and L2 Readers. TESOL Quarterly 26(2), 319–342 (1992)

    Article  Google Scholar 

  7. Bohanek, J., Fivush, R., Walker, E.: Memories of positive and negative emotional events. Applied Cognitive Psychology 19(1), 51–56 (2005)

    Article  Google Scholar 

  8. Brown, C., Snodgrass, T., Kemper, S., Herman, R., Covington, M.: Automatic measurement of propositional idea density from part-of-speech tagging. Behavior Research Methods 40(2), 540–545 (2008)

    Article  Google Scholar 

  9. Carr, N.: The factor structure of test task characteristics and examinee performance. Language Testing 23(3), 269–289 (2006)

    Article  Google Scholar 

  10. Chalhoub-Deville, M., Turner, C.: What to look for in ESL admission tests: Cambridge certificate exams, IELTS and TOEFL. System 28(4), 523–539 (2000)

    Article  Google Scholar 

  11. Chapelle, C., Jamieson, J., Hegelheimer, V.: Validation of a web-based ESL test. Language Testing 20(4), 409–439 (2003)

    Article  Google Scholar 

  12. Cobb, T.: Computing the vocabulary demands of L2 reading. Language Learning & Technology 11(3), 38–63 (2007)

    MathSciNet  Google Scholar 

  13. Cobb, T.: Learning about language and learners from computer programs. Reading in a Foreign Language 22(1), 181–200 (2010)

    Google Scholar 

  14. Cook, P., Dixon, W., Duckworth, M., Kaiser, K., Koehler, W., Meeker, Stephenson, W.: Beyond Traditional Statistical Methods. Iowa State University Press, Iowa (2000)

    Google Scholar 

  15. Covington, M.: CPIDR 3.0 User Manual. CASPR Research Report 2007-03. Artificial Intelligence Center, The University of Georgia (2007), http://www.ai.uga.edu/caspr

  16. Cox, D., Snell, E.: Analysis of Binary Data, 2nd edn. Chapman & Hall/CRC, New York (1989)

    Google Scholar 

  17. Coxhead, A.: A new academic word list. TESOL Quarterly 34(2), 213–238 (2000)

    Article  Google Scholar 

  18. Crossley, S., Greenfield, J., McNamara, D.: Assessing Text Readability Using Cognitively Based Indices. TESOL Quarterly 42(3), 475–492 (2008)

    Google Scholar 

  19. Crossley, S., Louwerse, M., McCarthy, P., McNamara, D.: A Linguistic Analysis of Simplified and Authentic Texts. The Modern Language Journal 91(1), 15–30 (2007)

    Article  Google Scholar 

  20. Crossley, S., Salsbury, T., McNamara, D., Jarvis, S.: Predicting lexical proficiency in language learner texts using computational indices. Language Testing 28(4), 561–580 (2011)

    Article  Google Scholar 

  21. Douglas, D.: Performance consistency in second language acquisition and language testing research: a conceptual gap. Second Language Research 17(4), 442–456 (2001)

    Article  Google Scholar 

  22. Durán, P., Malvern, D., Richards, B., Chipere, N.: Developmental trends in lexical diversity. Applied Linguistics 25(2), 220–242 (2004)

    Article  Google Scholar 

  23. Durán, N., McCarthy, P., Graesser, A., McNamara, D.: Using temporal cohesion to predict temporal coherence in narrative and expository texts. Behavior Research Methods 39(2), 212–223 (2007)

    Article  Google Scholar 

  24. Foster, J.: Data Analysis Using SPSS for Windows. Sage Publications Ltd, London (2001)

    MATH  Google Scholar 

  25. Freedle, R., Kostin, I.: Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL’s minitalks. Language Testing 16(1), 2–32 (1999)

    Google Scholar 

  26. Fulcher, G.: Text difficulty and accessibility: Reading Formulas and expert judgment. System 25(4), 497–513 (1997)

    Article  Google Scholar 

  27. Graesser, A., McNamara, D., Louwerse, M., Cai, Z.: Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments & Computers 36(2), 193–202 (2004)

    Article  Google Scholar 

  28. Green, A., Ünaldi, A., Weir, C.: Empiricism versus connoisseurship: Establishing the appropriacy of texts in tests of academic reading. Language Testing 27(2), 191–211 (2010)

    Article  Google Scholar 

  29. Haertl, B., McCarthy, P.: Differential Linguistic Features in U.S. Immigration Newspaper Articles: A Contrastive Corpus Analysis Using the Gramulator. In: Murray, C., McCarthy, P. (eds.) Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, pp. 349–350. The AAAI Press, Menlo Park (2011)

    Google Scholar 

  30. Hatch, E., Lazaraton, A.: The Research Manual: Design and Statistics for Applied Linguistics. Heinle & Heinle Publishers, Boston (1991)

    Google Scholar 

  31. Hullender, A., McCarthy, P.: A Contrastive Corpus Analysis of Modern Art Criticism and Photography Criticism. In: Murray, C., McCarthy, P. (eds.) Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, pp. 351–352. The AAAI Press, Menlo Park (2011)

    Google Scholar 

  32. Hutcheson, G.: Logistic Regression. In: Moutinho, L., Hutcheson, G. (eds.) The SAGE Dictionary of Quantitative Management Research, pp. 173–176. SAGE Publications Ltd., London (2011)

    Chapter  Google Scholar 

  33. Jarvis, S.: Short texts, best-fitting curves and new measures of lexical diversity. Language Testing 19(1), 57–84 (2002)

    Article  Google Scholar 

  34. Kahn, J., Tobin, R., Massey, A., Anderson, J.: Measuring Emotional Expression with the Linguistic Inquiry and Word Count. The American Journal of Psychology 120(2), 263–286 (2007)

    Google Scholar 

  35. Kintsch, W.: The Role of Knowledge in Discourse Comprehension: A Construction Integration Model. Psychological Review 95(2), 163–182 (1988)

    Article  Google Scholar 

  36. Lamkin, T., McCarthy, P.: The Hierarchy of Detective Fiction: A Gramulator Analysis. In: Murray, C., McCarthy, P. (eds.) Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, pp. 257–262. The AAAI Press, Menlo Park (2011)

    Google Scholar 

  37. Lee, J., Musumeci, D.: On Hierarchies of Reading Skills and Text Types. The Modern Language Journal 72(2), 173–187 (1988)

    Article  Google Scholar 

  38. Liu, H.: MontyLingua: An end-to-end natural language processor with common sense (Computer software and documentation) (2004), http://web.media.mit.edu/~hugo/montylingua (retrieved March 23, 2012)

  39. MacWhinney, B.: The Childes Project: Tools for Analyzing Talk. Lawrence Erlbaum Associates, Mahwah (2000)

    Google Scholar 

  40. MacWhinney, B., Snow, C.: The Child Language Data Exchange System: an update. Journal of Child Language 17(2), 457–472 (1990)

    Article  Google Scholar 

  41. Malvern, D., Richards, B.: A new measure of lexical diversity. In: Ryan, A., Wray, A. (eds.) Evolving Models of Language: Papers from the Annual Meeting of the British Association for Applied Linguistics Held at the University of Wales, pp. 58–71. Multilingual Matters, Clevedon (1996)

    Google Scholar 

  42. Malvern, D., Richards, B.: Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing 19(1), 85–104 (2002)

    Article  Google Scholar 

  43. Malvern, D., Richards, B., Chipere, N., Durán, P.: Lexical diversity and language development: Quantification and Assessment. Palgrave Macmillan, Houndmills (2004)

    Book  Google Scholar 

  44. McCarthy, P., Jarvis, S.: vocd: A theoretical and empirical evaluation. Language Testing 24(4), 459–488 (2007)

    Article  Google Scholar 

  45. McCarthy, P., Jarvis, S.: MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods 42(2), 381–392 (2010)

    Article  Google Scholar 

  46. McCarthy, P., Watanabe, S., Lamkin, T.: The Gramulator: A Tool to Identify Differential Linguistic Features of Correlative Text Types. In: McCarthy, P., Boonthum, C. (eds.) Applied natural language processing and content analysis: Identification, investigation, and resolution, pp. 312–333. IGI Global, Hershey (2012)

    Google Scholar 

  47. McKee, G., Malvern, D., Richards, B.: Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing 15(3), 323–337 (2000)

    Article  Google Scholar 

  48. McNamara, D., Cai, Z., Louwerse, M.: Optimizing LSA measures of cohesion. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis, pp. 379–400. Routledge, New York (2011)

    Google Scholar 

  49. McNamara, D., Louwerse, M., McCarthy, P., Graesser, A.: Coh-Metrix: Capturing Linguistic Features of Cohesion. Discourse Processes 47(4), 292–330 (2010)

    Article  Google Scholar 

  50. Meara, P.: Lexical Frequency Profiles: A Monte Carlo Analysis. Applied Linguistics 26(1), 32–47 (2005)

    Article  Google Scholar 

  51. Min, H., McCarthy, P.: Identifying Varietals in the Discourse of American and Korean Scientists: A Contrastive Corpus Analysis Using the Gramulator. In: Guesgen, H., Murray, C. (eds.) Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, pp. 247–252. The AAAI Press, Menlo Park (2010)

    Google Scholar 

  52. Nagelkerke, E.: A note on a general definition of the coefficient of determination. Biometrika 78(3), 691–692 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  53. Nation, P.: Using small corpora to investigate learner needs: two vocabulary research tools. In: Ghadessy, M., Henry, A., Roseberry, R. (eds.) Small Corpus Studies and ELT, pp. 31–45. John Benjamins, Amsterdam (2001)

    Chapter  Google Scholar 

  54. Nation, P.: How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review 63(1), 59–82 (2006)

    Article  Google Scholar 

  55. Nevo, N.: Test-taking strategies on a multiple-choice test of reading comprehension. Language Testing 6(2), 199–215 (1989)

    Article  Google Scholar 

  56. Oakland, T., Lane, H.: Language, Reading, and Readability Formulas: Implications for Developing and Adapting Tests. International Journal of Testing 4(3), 239–252 (2004)

    Article  Google Scholar 

  57. Pasupathi, M.: Telling and the remembered self: Linguistic differences in memories for previously disclosed and previously undisclosed events. Memory 15(3), 258–270 (2007)

    Article  Google Scholar 

  58. Pennebaker, J., King, L.: Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology 77(6), 1296–1312 (1999)

    Article  Google Scholar 

  59. Pennebaker, J., Booth, R., Francis, M.: Linguistic Inquiry and Word Count: LIWC 2007. LIWC.net, Austin (2007)

    Google Scholar 

  60. Phakiti, A.: A Closer Look at Gender and Strategy Use in L2 Reading. Language Learning 53(4), 649–702 (2003)

    Article  Google Scholar 

  61. Purpura, J.: An analysis of the relationships between test takers’ cognitive and metacognitive strategy use and second language test performance. Language Learning 47(2), 289–325 (1997)

    Article  Google Scholar 

  62. Rufenacht, R., McCarthy, P., Lamkin, T.: Fairy Tales and ESL Texts: An Analysis of Linguistic Features Using the Gramulator. In: Murray, C., McCarthy, P. (eds.) Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, pp. 287–292. The AAAI Press, Menlo Park (2011)

    Google Scholar 

  63. Shokrpour, N.: Systemic Functional Grammar as a Basis for Assessing Text Difficulty. Indian Journal of Applied Linguistics 30(2), 5–26 (2004)

    Google Scholar 

  64. Snowdon, D., Kemper, S., Mortimer, J., Greiner, L., Wekstein, D., Markesbery, W.: Linguistic ability in early life and cognitive function and Alzheimer’s disease in late life: Findings from the Nun Study. The Journal of the American Medical Association 275(7), 528–532 (1996)

    Article  Google Scholar 

  65. Tausczik, J., Pennebaker, W.: The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology 29(1), 24–54 (2010)

    Article  Google Scholar 

  66. Terwilleger, B., McCarthy, P., Lamkin, T.: Bias in Hard News Articles from Fox News and MSNBC: An Empirical Assessment Using the Gramulator. In: Murray, C., McCarthy, P. (eds.) Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, pp. 361–362. The AAAI Press, Menlo Park (2011)

    Google Scholar 

  67. Turner, A., Greene, E.: The construction and use of a propositional text base. Technical Report 63. Institute for the Study of Intellectual Behavior, University of Colorado (1977)

    Google Scholar 

  68. Ungerleider, C.: Large-Scale Student Assessment: Guidelines for Policymakers. International Journal of Testing 3(2), 119–128 (2003)

    Article  Google Scholar 

  69. Weir, C.: Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing 22(3), 281–300 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Liontou, T. (2014). Focused Information Retrieval & English Language Instruction: A New Text Complexity Algorithm for Automatic Text Classification. In: Prasath, R., O’Reilly, P., Kathirvalavakumar, T. (eds) Mining Intelligence and Knowledge Exploration. Lecture Notes in Computer Science(), vol 8891. Springer, Cham. https://doi.org/10.1007/978-3-319-13817-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13817-6_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13816-9

  • Online ISBN: 978-3-319-13817-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics