Reading and Writing

, Volume 30, Issue 6, pp 1287–1310 | Cite as

Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

  • Young-Suk Grace KimEmail author
  • Christopher Schatschneider
  • Jeanne Wanzek
  • Brandy Gatlin
  • Stephanie Al Otaiba


We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.


Generalizability theory Task effect Rater effect Assessment Writing 



Funding was provided by National Institute of Child Health and Human Development (Grant No. P50HD052120). The authors wish to thank participating schools, teachers, and students.


  1. Abbott, R. D., & Berninger, V. W. (1993). Structural equation modeling of relationships Among developmental skills and writing skills in primary- and intermediate-grade writers. Journal of Educational Psychology, 85, 478–508.CrossRefGoogle Scholar
  2. Applebee, A. N., & Langer, J. A. (2006). The state of writing instruction in America’s schools: What existing data tell us. Albany, NY: University at SUNY, Albany.Google Scholar
  3. Bachman, L. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  4. Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12, 86–107.CrossRefGoogle Scholar
  5. Beck, S. W., & Jeffery, J. V. (2007). Genres of high-stakes writing assessments and the construct of writing competence. Assessing Writing, 12, 60–79.CrossRefGoogle Scholar
  6. Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  7. Bouwer, R., Beguin, A., Sanders, T., & van den Bergh, H. (2015). Effect of genre on the generalizability of writing scores. Language Testing, 32, 83–100.CrossRefGoogle Scholar
  8. Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1–21.CrossRefGoogle Scholar
  9. Brennan, R. L., Goa, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55, 157–176.CrossRefGoogle Scholar
  10. Coker, D. L., & Ritchey, K. D. (2010). Curriculum based measurement of writing in kindergarten and first grade: An investigation of production and qualitative scores. Exceptional Children, 76, 175–193.CrossRefGoogle Scholar
  11. Cooper, P. L. (1984). The assessment of writing ability: A review of research. GRE Board research report no. GREB 82-15R/ETS research report no. 84-12). Princeton, NJ: Educational Testing Service.Google Scholar
  12. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.Google Scholar
  13. Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86, 67–96.CrossRefGoogle Scholar
  14. Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232.Google Scholar
  15. DeVellis, R. F. (1991). Scale development. Newbury Park, NJ: Sage.Google Scholar
  16. Duke, N. K. (2014). Inside information: Developing powerful readers and writers of informational text through project-based instruction. New York: Scholastic.Google Scholar
  17. Duke, N. K., & Roberts, K. M. (2010). The genre-specific nature of reading comprehension. In D. Wyse, R. Andrews, & J. Hoffman (Eds.), The Routledge international handbook of english, language and literacy teaching (pp. 74–86). London: Routledge.Google Scholar
  18. East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14, 88–115.CrossRefGoogle Scholar
  19. Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185.CrossRefGoogle Scholar
  20. Espin, C. A., De La Paz, S., Scierka, B. J., & Roelofs, L. (2005). The relationship between curriculum-based measures in written expression and quality and completeness of expository writing for middle school students. The Journal of Special Education, 38, 208–217.CrossRefGoogle Scholar
  21. Florida Comprehensive Assessment Test (FCAT) 2012 writing: Grade 4 narrative task anchor set. Retrieved from
  22. Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, G. M., & Slider, N. J. (2002). Moving beyond total words written: The reliability, criterion validity, and time cost of alternate measures for curriculum-based measurement in writing. School Psychology Review, 31, 477–497.Google Scholar
  23. Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Slider, N. J., Hoffpauir, L. D., Whitmarsh, E. L., et al. (2004). An examination of the criterion validity and sensitivity to brief intervention of alternate curriculum-based measures of writing skill. Psychology in the Schools, 41, 291–300.CrossRefGoogle Scholar
  24. Gansle, K. A., VanDerHeyden, A. M., Noell, G. H., Resetar, J. L., & Williams, K. L. (2006). The technical adequacy of curriculum-based and rating-based measures of written expression for elementary school students. School Psychology Review, 35, 435–450.Google Scholar
  25. Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all? Language Testing, 26, 507–531.CrossRefGoogle Scholar
  26. Graham, S., Berninger, V. W., Abbott, R. D., Abbott, S. P., & Whitaker, D. (1997). Role of mechanics in composing of elementary school students: A new methodological approach. Journal of Educational Psychology, 89, 170–182.CrossRefGoogle Scholar
  27. Graham, S., Harris, K., & Hebert, M. (2011). Informing writing: The benefits of formative assessment. Washington, DC: Alliance for Excellent Education.Google Scholar
  28. Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). A study of the writing tasks assigned in academic degree programs. In: TOEFL Research Report 54. Princeton, NJ: Educational Testing Service.Google Scholar
  29. Hammill, D. D., & Larsen, S. C. (1996). Test of Written Language-3. Austin, TX: Pro-ed.Google Scholar
  30. Hammill, D. D., & Larsen, S. C. (2009). Test of Written Language-4th edition (TOWL-4). Austin, TX: Pro-Ed.Google Scholar
  31. Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 12, 1–9.CrossRefGoogle Scholar
  32. Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60, 237–263.CrossRefGoogle Scholar
  33. Jewell J., & Malecki C. K. (2005). The utility of CBM written language indices: An investigation of production-dependent, production-independent, and accurate-production scores. School Psychology Review, 34, 27–44.Google Scholar
  34. Kim, Y.-S., Al Otaiba, S., Puranik, C., Sidler, J. F., Greulich, L., & Wagner, R. K. (2011). Componential skills of beginning writing: An exploratory study. Learning and Individual Differences, 21, 517–525.CrossRefGoogle Scholar
  35. Kim, Y.-S., Al Otaiba, S., Sidler, J. F., & Greulich, L. (2013). Language, literacy, attentional behaviors, and instructional quality predictors of written composition for first graders. Early Childhood Research Quarterly, 28, 461–469.CrossRefGoogle Scholar
  36. Kim, Y.-S., Al Otaiba, S., Folsom, J. S., Greulich, L., & Puranik, C. (2014). Evaluating the dimensionality of first grade written composition. Journal of Speech, Language, and Hearing Research, 57, 199–211.CrossRefGoogle Scholar
  37. Kim, Y.-S., Al Otaiba, S., Wanzek, J., & Gatlin, B. (2015). Towards an understanding of dimension, predictors, and gender gaps in written composition. Journal of Educational Psychology, 107, 79–95.CrossRefGoogle Scholar
  38. Kondo-Brown, K. (2002). A facets analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19, 3–31.CrossRefGoogle Scholar
  39. Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31, 329–348.CrossRefGoogle Scholar
  40. Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays. Applied Measurement in Education, 2, 195–205.CrossRefGoogle Scholar
  41. Lembke, E., Deno, S. L., & Hall, K. (2003). Identifying an indicator of growth in early writing proficiency for elementary school students. Assessment for Effective Intervention, 28, 23–35.CrossRefGoogle Scholar
  42. McMaster, K. L., Du, X., & Pestursdottir, A. L. (2009). Technical features of curriculum-based measures for beginning writers. Journal of Learning Disabilities, 42, 41–60.CrossRefGoogle Scholar
  43. McMaster, K. L., Du, X., Yeo, S., Deno, S. L., Parker, D., & Ellis, T. (2011). Curriculum-based measures of beginning writing: Technical features of the slope. Exceptional Children, 77, 185–206.CrossRefGoogle Scholar
  44. McMaster, K., & Espin, C. (2007). Technical features of curriculum-based measurement in writing: A literature review. The Journal of Special Education, 41, 68–84.CrossRefGoogle Scholar
  45. Moore, & T., Morton, J. (1999). Authenticity in the IELTS academic module writing test: A comparative study of task 2 items and university assignments. In: IELTS Research Reports No. 2 (pp. 74–116). Canberra: IELTS Australia.Google Scholar
  46. Mushquash, C., & O’Connor, B. P. (2006). SPSS and SAS programs for generalizability theory analyses. Behavioral Research Methods, 38, 542–547.CrossRefGoogle Scholar
  47. National Center for Education Statistics. (1999). The NAEP 1998 writing report card for the nation and the states, NCES 1999-462, by E. A. Greenwald, H. R. Persky, J. R. Campbell, and J. Mazzeo. Washington, DC.Google Scholar
  48. National Center for Education Statistics. (2003). The nation’s report card: Writing 2002, NCES 2003-529 by H. R. Persky, M. C. Dane, & Y. Jin. Retrieved from
  49. National Center for Education Statistics. (2012). The nation’s report card: Writing 2011 (NCES 2012-470). Washington, DC: Institute of Education Sciences, U.S. Department of Education. Retrieved from
  50. National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common Core State Standards for English language arts and literacy in history/social studies, science, and technical subjects. Washington, DC: Authors.Google Scholar
  51. Nunnally, J. C. (1967). Psychometric theory. New York: McGraw Hill.Google Scholar
  52. Olinghouse, N. G. (2008). Student- and instruction-level predictors of narrative writing in third-grade students. Reading and Writing: An Interdisciplinary Journal, 21, 3–26.CrossRefGoogle Scholar
  53. Olinghouse, N. G., & Graham, S. (2009). The relationship between discourse knowledge and the writing performance of elementary-grade students. Journal of Educational Psychology, 101, 37–50.CrossRefGoogle Scholar
  54. Olinghouse, N. G., Santangelo, T., & Wilson, J. (2012). Examining the validity of single-occasion, single-genre, holistically scored writing assessments. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 55–82). Leiden: Koninklije Brill.Google Scholar
  55. Puranik, C. S., Lombardino, L. J., & Altmann, L. J. (2007). Writing through retellings: An exploratory study of language-impaired and dyslexic populations. Reading and Writing: An Interdisciplinary Journal, 20, 251–272.CrossRefGoogle Scholar
  56. Puranik, C., Lombardino, L., & Altmann, L. (2008). Assessing the microstructure of written language using a retelling paradigm. American Journal of Speech Language Pathology, 17, 107–120.CrossRefGoogle Scholar
  57. Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1–30.CrossRefGoogle Scholar
  58. Schoonen, R. (2012). The validity and generalizability of writing scores: The effect of rater, task and language. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 1–22). Leiden: Koninklije Brill.Google Scholar
  59. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Google Scholar
  60. Shavelson, R., Webb, N., & Rowley, G. (1989). Generalizability theory. American Psychologist, 44, 922–932.CrossRefGoogle Scholar
  61. Stuhlmann, J., Daniel, C., Delinger, A., Denny, R. K., & Powers, T. (1999). A generalizability study of the effects of training on teachers’ abilities to rate children’s writing using a rubric. Journal of Reading Psychology, 20, 107–127.CrossRefGoogle Scholar
  62. Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., de Kruif, R. E. L., Reed, M., et al. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Education and Psychological Measurement, 59, 492–506.CrossRefGoogle Scholar
  63. Tillema, M., van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30, 1–27.Google Scholar
  64. van den Bergh, H., De Maeyer, S., van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 23–32). Leiden: Koninklije Brill.Google Scholar
  65. Wagner, R. K., Puranik, C. S., Foorman, B., Foster, E., Tschinkel, E., & Kantor, P. T. (2011). Modeling the development of written language. Reading and Writing: An Interdisciplinary Journal, 24, 203–220.CrossRefGoogle Scholar
  66. Wechsler, D. (2009). Wechsler Individual Achievement Test-3rd edition (WIAT-3). San Antonio, TX: Pearson.Google Scholar
  67. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  • Young-Suk Grace Kim
    • 1
    Email author
  • Christopher Schatschneider
    • 2
  • Jeanne Wanzek
    • 3
  • Brandy Gatlin
    • 4
  • Stephanie Al Otaiba
    • 5
  1. 1.University of California, IrvineIrvineUSA
  2. 2.Florida Center for Reading Research, Florida State UniversityTallahasseeUSA
  3. 3.Vanderbilt UniversityNashvilleUSA
  4. 4.Georgia State UniversityAtlantaUSA
  5. 5.Southern Methodist UniversityDallasUSA

Personalised recommendations