Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

Kim, Young-Suk Grace; Schatschneider, Christopher; Wanzek, Jeanne; Gatlin, Brandy; Al Otaiba, Stephanie

doi:10.1007/s11145-017-9724-6

Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

Published: 06 February 2017

Volume 30, pages 1287–1310, (2017)
Cite this article

Reading and Writing Aims and scope Submit manuscript

Young-Suk Grace Kim¹,
Christopher Schatschneider²,
Jeanne Wanzek³,
Brandy Gatlin⁴ &
…
Stephanie Al Otaiba⁵

1183 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

We examined how raters and tasks influence measurement error in writing evaluation and how many raters and tasks are needed to reach a desirable level of .90 and .80 reliabilities for children in Grades 3 and 4. A total of 211 children (102 boys) were administered three tasks in narrative and expository genres, respectively, and their written compositions were evaluated in widely used evaluation methods for developing writers: holistic scoring, productivity, and curriculum-based writing scores. Results showed that 54 and 52% of variance in narrative and expository compositions were attributable to true individual differences in writing. Students’ scores varied largely by tasks (30.44 and 28.61% of variance), but not by raters. To reach the reliability of .90, multiple tasks and raters were needed, and for the reliability of .80, a single rater and multiple tasks were needed. These findings offer important implications about reliably evaluating children’s writing skills, given that writing is typically evaluated by a single task and a single rater in classrooms and even in some state accountability systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

Article Open access 01 October 2021

Ali Khodi

The Role of the Rater in Writing Assessment

Writing scale effects on raters: an exploratory study

Article Open access 30 December 2019

Heejeong Jeong

Notes

Children were given 15 min based on our experiences with elementary grade children. CBM writing assessments (e.g., writing tasks) typically have shorter assessment times (e.g., 3 min). This does not present a validity issue in the present study because the purpose of our study was examining reliability of various evaluation approaches including CBM writing indicators, not a particular CBM writing test (e.g., picture task) per se.
Facets are measurement features or sources of variation such as person, rater, and task.

References

Abbott, R. D., & Berninger, V. W. (1993). Structural equation modeling of relationships Among developmental skills and writing skills in primary- and intermediate-grade writers. Journal of Educational Psychology, 85, 478–508.
Article Google Scholar
Applebee, A. N., & Langer, J. A. (2006). The state of writing instruction in America’s schools: What existing data tell us. Albany, NY: University at SUNY, Albany.
Google Scholar
Bachman, L. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press.
Book Google Scholar
Barkaoui, K. (2007). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12, 86–107.
Article Google Scholar
Beck, S. W., & Jeffery, J. V. (2007). Genres of high-stakes writing assessments and the construct of writing competence. Assessing Writing, 12, 60–79.
Article Google Scholar
Bereiter, C., & Scardamalia, M. (1987). The psychology of written composition. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Bouwer, R., Beguin, A., Sanders, T., & van den Bergh, H. (2015). Effect of genre on the generalizability of writing scores. Language Testing, 32, 83–100.
Article Google Scholar
Brennan, R. L. (2011). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1–21.
Article Google Scholar
Brennan, R. L., Goa, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55, 157–176.
Article Google Scholar
Coker, D. L., & Ritchey, K. D. (2010). Curriculum based measurement of writing in kindergarten and first grade: An investigation of production and qualitative scores. Exceptional Children, 76, 175–193.
Article Google Scholar
Cooper, P. L. (1984). The assessment of writing ability: A review of research. GRE Board research report no. GREB 82-15R/ETS research report no. 84-12). Princeton, NJ: Educational Testing Service.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Google Scholar
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86, 67–96.
Article Google Scholar
Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219–232.
Google Scholar
DeVellis, R. F. (1991). Scale development. Newbury Park, NJ: Sage.
Google Scholar
Duke, N. K. (2014). Inside information: Developing powerful readers and writers of informational text through project-based instruction. New York: Scholastic.
Google Scholar
Duke, N. K., & Roberts, K. M. (2010). The genre-specific nature of reading comprehension. In D. Wyse, R. Andrews, & J. Hoffman (Eds.), The Routledge international handbook of english, language and literacy teaching (pp. 74–86). London: Routledge.
Google Scholar
East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language writing. Assessing Writing, 14, 88–115.
Article Google Scholar
Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185.
Article Google Scholar
Espin, C. A., De La Paz, S., Scierka, B. J., & Roelofs, L. (2005). The relationship between curriculum-based measures in written expression and quality and completeness of expository writing for middle school students. The Journal of Special Education, 38, 208–217.
Article Google Scholar
Florida Comprehensive Assessment Test (FCAT) 2012 writing: Grade 4 narrative task anchor set. Retrieved from http://fcat.fldoe.org/pdf/G4N12WritingAnchorSet.pdf.
Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Naquin, G. M., & Slider, N. J. (2002). Moving beyond total words written: The reliability, criterion validity, and time cost of alternate measures for curriculum-based measurement in writing. School Psychology Review, 31, 477–497.
Google Scholar
Gansle, K. A., Noell, G. H., VanDerHeyden, A. M., Slider, N. J., Hoffpauir, L. D., Whitmarsh, E. L., et al. (2004). An examination of the criterion validity and sensitivity to brief intervention of alternate curriculum-based measures of writing skill. Psychology in the Schools, 41, 291–300.
Article Google Scholar
Gansle, K. A., VanDerHeyden, A. M., Noell, G. H., Resetar, J. L., & Williams, K. L. (2006). The technical adequacy of curriculum-based and rating-based measures of written expression for elementary school students. School Psychology Review, 35, 435–450.
Google Scholar
Gebril, A. (2009). Score generalizability of academic writing tasks: Does one test method fit it all? Language Testing, 26, 507–531.
Article Google Scholar
Graham, S., Berninger, V. W., Abbott, R. D., Abbott, S. P., & Whitaker, D. (1997). Role of mechanics in composing of elementary school students: A new methodological approach. Journal of Educational Psychology, 89, 170–182.
Article Google Scholar
Graham, S., Harris, K., & Hebert, M. (2011). Informing writing: The benefits of formative assessment. Washington, DC: Alliance for Excellent Education.
Google Scholar
Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). A study of the writing tasks assigned in academic degree programs. In: TOEFL Research Report 54. Princeton, NJ: Educational Testing Service.
Hammill, D. D., & Larsen, S. C. (1996). Test of Written Language-3. Austin, TX: Pro-ed.
Google Scholar
Hammill, D. D., & Larsen, S. C. (2009). Test of Written Language-4th edition (TOWL-4). Austin, TX: Pro-Ed.
Google Scholar
Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 12, 1–9.
Article Google Scholar
Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60, 237–263.
Article Google Scholar
Jewell J., & Malecki C. K. (2005). The utility of CBM written language indices: An investigation of production-dependent, production-independent, and accurate-production scores. School Psychology Review, 34, 27–44.
Google Scholar
Kim, Y.-S., Al Otaiba, S., Puranik, C., Sidler, J. F., Greulich, L., & Wagner, R. K. (2011). Componential skills of beginning writing: An exploratory study. Learning and Individual Differences, 21, 517–525.
Article Google Scholar
Kim, Y.-S., Al Otaiba, S., Sidler, J. F., & Greulich, L. (2013). Language, literacy, attentional behaviors, and instructional quality predictors of written composition for first graders. Early Childhood Research Quarterly, 28, 461–469.
Article Google Scholar
Kim, Y.-S., Al Otaiba, S., Folsom, J. S., Greulich, L., & Puranik, C. (2014). Evaluating the dimensionality of first grade written composition. Journal of Speech, Language, and Hearing Research, 57, 199–211.
Article Google Scholar
Kim, Y.-S., Al Otaiba, S., Wanzek, J., & Gatlin, B. (2015). Towards an understanding of dimension, predictors, and gender gaps in written composition. Journal of Educational Psychology, 107, 79–95.
Article Google Scholar
Kondo-Brown, K. (2002). A facets analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19, 3–31.
Article Google Scholar
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31, 329–348.
Article Google Scholar
Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays. Applied Measurement in Education, 2, 195–205.
Article Google Scholar
Lembke, E., Deno, S. L., & Hall, K. (2003). Identifying an indicator of growth in early writing proficiency for elementary school students. Assessment for Effective Intervention, 28, 23–35.
Article Google Scholar
McMaster, K. L., Du, X., & Pestursdottir, A. L. (2009). Technical features of curriculum-based measures for beginning writers. Journal of Learning Disabilities, 42, 41–60.
Article Google Scholar
McMaster, K. L., Du, X., Yeo, S., Deno, S. L., Parker, D., & Ellis, T. (2011). Curriculum-based measures of beginning writing: Technical features of the slope. Exceptional Children, 77, 185–206.
Article Google Scholar
McMaster, K., & Espin, C. (2007). Technical features of curriculum-based measurement in writing: A literature review. The Journal of Special Education, 41, 68–84.
Article Google Scholar
Moore, & T., Morton, J. (1999). Authenticity in the IELTS academic module writing test: A comparative study of task 2 items and university assignments. In: IELTS Research Reports No. 2 (pp. 74–116). Canberra: IELTS Australia.
Mushquash, C., & O’Connor, B. P. (2006). SPSS and SAS programs for generalizability theory analyses. Behavioral Research Methods, 38, 542–547.
Article Google Scholar
National Center for Education Statistics. (1999). The NAEP 1998 writing report card for the nation and the states, NCES 1999-462, by E. A. Greenwald, H. R. Persky, J. R. Campbell, and J. Mazzeo. Washington, DC.
National Center for Education Statistics. (2003). The nation’s report card: Writing 2002, NCES 2003-529 by H. R. Persky, M. C. Dane, & Y. Jin. Retrieved from http://nces.ed.gov/.
National Center for Education Statistics. (2012). The nation’s report card: Writing 2011 (NCES 2012-470). Washington, DC: Institute of Education Sciences, U.S. Department of Education. Retrieved from http://nces.ed.gov/nationsreportcard/pdf/main2011/2012470.pdf.
National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common Core State Standards for English language arts and literacy in history/social studies, science, and technical subjects. Washington, DC: Authors.
Google Scholar
Nunnally, J. C. (1967). Psychometric theory. New York: McGraw Hill.
Google Scholar
Olinghouse, N. G. (2008). Student- and instruction-level predictors of narrative writing in third-grade students. Reading and Writing: An Interdisciplinary Journal, 21, 3–26.
Article Google Scholar
Olinghouse, N. G., & Graham, S. (2009). The relationship between discourse knowledge and the writing performance of elementary-grade students. Journal of Educational Psychology, 101, 37–50.
Article Google Scholar
Olinghouse, N. G., Santangelo, T., & Wilson, J. (2012). Examining the validity of single-occasion, single-genre, holistically scored writing assessments. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 55–82). Leiden: Koninklije Brill.
Google Scholar
Puranik, C. S., Lombardino, L. J., & Altmann, L. J. (2007). Writing through retellings: An exploratory study of language-impaired and dyslexic populations. Reading and Writing: An Interdisciplinary Journal, 20, 251–272.
Article Google Scholar
Puranik, C., Lombardino, L., & Altmann, L. (2008). Assessing the microstructure of written language using a retelling paradigm. American Journal of Speech Language Pathology, 17, 107–120.
Article Google Scholar
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1–30.
Article Google Scholar
Schoonen, R. (2012). The validity and generalizability of writing scores: The effect of rater, task and language. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 1–22). Leiden: Koninklije Brill.
Google Scholar
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Google Scholar
Shavelson, R., Webb, N., & Rowley, G. (1989). Generalizability theory. American Psychologist, 44, 922–932.
Article Google Scholar
Stuhlmann, J., Daniel, C., Delinger, A., Denny, R. K., & Powers, T. (1999). A generalizability study of the effects of training on teachers’ abilities to rate children’s writing using a rubric. Journal of Reading Psychology, 20, 107–127.
Article Google Scholar
Swartz, C. W., Hooper, S. R., Montgomery, J. W., Wakely, M. B., de Kruif, R. E. L., Reed, M., et al. (1999). Using generalizability theory to estimate the reliability of writing scores derived from holistic and analytical scoring methods. Education and Psychological Measurement, 59, 492–506.
Article Google Scholar
Tillema, M., van den Bergh, H., Rijlaarsdam, G., & Sanders, T. (2012). Quantifying the quality difference between L1 and L2 essays: A rating procedure with bilingual raters and L1 and L2 benchmark essays. Language Testing, 30, 1–27.
Google Scholar
van den Bergh, H., De Maeyer, S., van Weijen, D., & Tillema, M. (2012). Generalizability of text quality scores. In E. Van Steendam (Ed.), Measuring writing: Recent insights into theory, methodology and practices (pp. 23–32). Leiden: Koninklije Brill.
Google Scholar
Wagner, R. K., Puranik, C. S., Foorman, B., Foster, E., Tschinkel, E., & Kantor, P. T. (2011). Modeling the development of written language. Reading and Writing: An Interdisciplinary Journal, 24, 203–220.
Article Google Scholar
Wechsler, D. (2009). Wechsler Individual Achievement Test-3rd edition (WIAT-3). San Antonio, TX: Pearson.
Google Scholar
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.
Article Google Scholar

Download references

Acknowledgements

Funding was provided by National Institute of Child Health and Human Development (Grant No. P50HD052120). The authors wish to thank participating schools, teachers, and students.

Author information

Authors and Affiliations

University of California, Irvine, 3500 Education Building, Irvine, CA, 92697, USA
Young-Suk Grace Kim
Florida Center for Reading Research, Florida State University, Tallahassee, FL, USA
Christopher Schatschneider
Vanderbilt University, Nashville, TN, USA
Jeanne Wanzek
Georgia State University, Atlanta, GA, USA
Brandy Gatlin
Southern Methodist University, Dallas, TX, USA
Stephanie Al Otaiba

Authors

Young-Suk Grace Kim
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Schatschneider
View author publications
You can also search for this author in PubMed Google Scholar
Jeanne Wanzek
View author publications
You can also search for this author in PubMed Google Scholar
Brandy Gatlin
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie Al Otaiba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Young-Suk Grace Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, YS.G., Schatschneider, C., Wanzek, J. et al. Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4. Read Writ 30, 1287–1310 (2017). https://doi.org/10.1007/s11145-017-9724-6

Download citation

Published: 06 February 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11145-017-9724-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

Abstract

Access this article

Similar content being viewed by others

The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

The Role of the Rater in Writing Assessment

Writing scale effects on raters: an exploratory study

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Writing evaluation: rater and task effects on the reliability of writing scores for children in Grades 3 and 4

Abstract

Access this article

Similar content being viewed by others

The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

The Role of the Rater in Writing Assessment

Writing scale effects on raters: an exploratory study

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation