Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4
- 419 Downloads
We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.
KeywordsGeneralizability theory Task effect Rater effect Assessment Writing
Funding was provided by National Institute of Child Health and Human Development (Grant No. P50HD052120). The authors wish to thank participating schools, teachers, and students.
- Applebee, A. N., & Langer, J. A. (2006). The state of writing instruction in America’s schools: What existing data tell us. Albany, NY: University at SUNY, Albany.Google Scholar
- Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Cooper, P. L. (1984). The assessment of writing ability: A review of research. GRE Board research report no. GREB 82-15R/ETS research report no. 84-12). Princeton, NJ: Educational Testing Service.Google Scholar
- Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.Google Scholar
- Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232.Google Scholar
- DeVellis, R. F. (1991). Scale development. Newbury Park, NJ: Sage.Google Scholar
- Duke, N. K. (2014). Inside information: Developing powerful readers and writers of informational text through project-based instruction. New York: Scholastic.Google Scholar
- Duke, N. K., & Roberts, K. M. (2010). The genre-specific nature of reading comprehension. In D. Wyse, R. Andrews, & J. Hoffman (Eds.), The Routledge international handbook of english, language and literacy teaching (pp. 74–86). London: Routledge.Google Scholar
- Florida Comprehensive Assessment Test (FCAT) 2012 writing: Grade 4 narrative task anchor set. Retrieved from http://fcat.fldoe.org/pdf/G4N12WritingAnchorSet.pdf.
- Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, G. M., & Slider, N. J. (2002). Moving beyond total words written: The reliability, criterion validity, and time cost of alternate measures for curriculum-based measurement in writing. School Psychology Review, 31, 477–497.Google Scholar
- Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Slider, N. J., Hoffpauir, L. D., Whitmarsh, E. L., et al. (2004). An examination of the criterion validity and sensitivity to brief intervention of alternate curriculum-based measures of writing skill. Psychology in the Schools, 41, 291–300.CrossRefGoogle Scholar
- Gansle, K. A., VanDerHeyden, A. M., Noell, G. H., Resetar, J. L., & Williams, K. L. (2006). The technical adequacy of curriculum-based and rating-based measures of written expression for elementary school students. School Psychology Review, 35, 435–450.Google Scholar
- Graham, S., Harris, K., & Hebert, M. (2011). Informing writing: The benefits of formative assessment. Washington, DC: Alliance for Excellent Education.Google Scholar
- Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). A study of the writing tasks assigned in academic degree programs. In: TOEFL Research Report 54. Princeton, NJ: Educational Testing Service.Google Scholar
- Hammill, D. D., & Larsen, S. C. (1996). Test of Written Language-3. Austin, TX: Pro-ed.Google Scholar
- Hammill, D. D., & Larsen, S. C. (2009). Test of Written Language-4th edition (TOWL-4). Austin, TX: Pro-Ed.Google Scholar
- Jewell J., & Malecki C. K. (2005). The utility of CBM written language indices: An investigation of production-dependent, production-independent, and accurate-production scores. School Psychology Review, 34, 27–44.Google Scholar
- Moore, & T., Morton, J. (1999). Authenticity in the IELTS academic module writing test: A comparative study of task 2 items and university assignments. In: IELTS Research Reports No. 2 (pp. 74–116). Canberra: IELTS Australia.Google Scholar
- National Center for Education Statistics. (1999). The NAEP 1998 writing report card for the nation and the states, NCES 1999-462, by E. A. Greenwald, H. R. Persky, J. R. Campbell, and J. Mazzeo. Washington, DC.Google Scholar
- National Center for Education Statistics. (2003). The nation’s report card: Writing 2002, NCES 2003-529 by H. R. Persky, M. C. Dane, & Y. Jin. Retrieved from http://nces.ed.gov/.
- National Center for Education Statistics. (2012). The nation’s report card: Writing 2011 (NCES 2012-470). Washington, DC: Institute of Education Sciences, U.S. Department of Education. Retrieved from http://nces.ed.gov/nationsreportcard/pdf/main2011/2012470.pdf.
- National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common Core State Standards for English language arts and literacy in history/social studies, science, and technical subjects. Washington, DC: Authors.Google Scholar
- Nunnally, J. C. (1967). Psychometric theory. New York: McGraw Hill.Google Scholar
- Olinghouse, N. G., Santangelo, T., & Wilson, J. (2012). Examining the validity of single-occasion, single-genre, holistically scored writing assessments. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 55–82). Leiden: Koninklije Brill.Google Scholar
- Schoonen, R. (2012). The validity and generalizability of writing scores: The effect of rater, task and language. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 1–22). Leiden: Koninklije Brill.Google Scholar
- Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Google Scholar
- Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., de Kruif, R. E. L., Reed, M., et al. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Education and Psychological Measurement, 59, 492–506.CrossRefGoogle Scholar
- Tillema, M., van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30, 1–27.Google Scholar
- van den Bergh, H., De Maeyer, S., van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 23–32). Leiden: Koninklije Brill.Google Scholar
- Wechsler, D. (2009). Wechsler Individual Achievement Test-3rd edition (WIAT-3). San Antonio, TX: Pearson.Google Scholar