A Second Dystopia in Education: Validity Issues in Authentic Assessment Practices

Hathcoat, John D.; Penn, Jeremy D.; Barnes, Laura L. B.; Comer, Johnathan C.

doi:10.1007/s11162-016-9407-1

A Second Dystopia in Education: Validity Issues in Authentic Assessment Practices

Published: 11 January 2016

Volume 57, pages 892–912, (2016)
Cite this article

Research in Higher Education Aims and scope Submit manuscript

John D. Hathcoat¹,
Jeremy D. Penn²,
Laura L. B. Barnes³ &
…
Johnathan C. Comer⁴

2392 Accesses
8 Citations
4 Altmetric
3 Mentions
Explore all metrics

Abstract

Authentic assessments used in response to accountability demands in higher education face at least two threats to validity. First, a lack of interchangeability between assessment tasks introduces bias when using aggregate-based scores at an institutional level. Second, reliance on written products to capture constructs such as critical thinking (CT) may introduce construct-irrelevant variance if score variance reflects written communication (WC) skill as well as variation in the construct of interest. Two studies investigated these threats to validity. Student written responses to faculty in-class assignments were sampled from general education courses within an institution. Faculty raters trained to use a common rubric than rated the students’ written papers. The first study used hierarchical linear modeling to estimate the magnitude of between-assignment variance in CT scores among 343 student-written papers nested within 18 assignments. About 18 % of the total CT variance was attributed to differences in average CT scores indicating that assignments were not interchangeable. Approximately 47 % of this between-assignment variance was predicted by the extent to which the assignments requested students to demonstrate their own perspective. Thus aggregating CT scores across students and assignments could bias the scores up or down depending on the characteristics of the assignments, particularly perspective-taking. The second study used exploratory factor analysis and squared partial correlations to estimate the magnitude of construct-irrelevant variance in CT scores. Student papers were rated for CT by one group of faculty and for WC by a different group of faculty. Nearly 25 % of the variance in CT scores was attributed to differences in WC scores. Score-based interpretations of CT may need to be delimited if observations are solely obtained through written products. Both studies imply a need to gather additional validity evidence in authentic assessment practices before this strategy is widely adopted among institutions of higher education. Authors also address misconceptions about standardization in authentic assessment practices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Comparative Judgement

Adding authenticity to controlled conditions assessment: introduction of an online, open book, essay based exam

Article Open access 03 July 2018

Equity, Ethics and Engagement: Principles for Quality Formative Assessment in Primary Science Classrooms

References

AERA, APA, NCME (2014). Standards for Educational and Psychological Testing. Washington: American Educational Research Association, American Psychological Association, & National Council on Measurement in Education.
Association of American Colleges and Universities (2007). College learning for the new global century: A report from the national leadership council for liberal education & America’s promise. Retrieved 15 Dec 2012 http://www.aacu.org/leap/documents/GlobalCentury_final.pdf.
Banta, T. W., & Pike, G. R. (2007). Revisiting the blind alley of value-added. Assessment Update, 19(1–2), 14–15.
Google Scholar
Banta, T.W., & Pike, G.R. (2012). Making the case against—One more time. Occasional paper #15. National Institute for Learning Outcomes Assessment. Retrieved 9 July 2012 http://learningoutcomesassessment.org/documents/AlternativesforAssessment.pdf.
Benjamin, R. (2012). The seven red herrings about standardized assessments in higher education. National Institute for Learning Outcomes Assessment. Occasional paper #15. National Institute for Learning Outcomes Assessment. Retrieved 18 Sep 2012 http://learningoutcomesassessment.org/documents/HerringPaperFINAL.pdf.
Brennan, R. L. (2001). Generalizability theory. New York: Springer.
Book Google Scholar
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.
Article Google Scholar
Cardinet, J., Johnson, S., & Pini, G. (2010). Applying generalizability theory using EduG. New York: Taylor and Francis Group.
Google Scholar
Cizek, G. J. (1991a). Innovation or enervation? Performance assessment in perspective. Phi Delta Kappan, 72, 695–699.
Google Scholar
Cizek, G. J. (1991b). Confusion effusion: A rejoinder to Wiggins. Phi Delta Kappan, 73, 150–153.
Google Scholar
Clarke-Midura, J., Dede, C., & Norton, J. (2011). Next generation assessments for measuring complex learning in science. The road ahead for state assessments, 27-40.
Crocker, L. C., & Algina, J. (1986). Introduction to classical and modern test theory. Belmont: Wadsworth Group.
Google Scholar
Ewell, P. (2012). Forward. National Institute for Learning Outcomes Assessment. Occasional paper #15. National Institute for Learning Outcomes Assessment. Retrieved September 18, 2012 from http://learningoutcomesassessment.org/documents/HerringPaperFINAL.pdf.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale: Erlbaum.
Google Scholar
Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66, 930–944. doi:10.1177/0013164406288165.
Article Google Scholar
Gulikers, J. T., Bastiaens, T. J., & Kirschner, P. A. (2004). A five-dimensional framework for authentic assessment. Educational Technology Research and Development, 52(3), 67–86.
Article Google Scholar
Gwet, K. L. (2012). Handbook of inter-rater reliability: A definitive guide to measuring the extent of agreement among multiple raters (3rd ed.). Gaithersburg: Advanced Analytics.
Google Scholar
Hathcoat, J. D., & Penn, J. D. (2012). Generalizability of student writing across multiple tasks: A challenge for authentic assessment. Research and Practice in Assessment, 7, 17–29.
Google Scholar
Janesick, V. J. (2006). Authentic assessment primer. New York: Peter Lang Publishing Inc.
Google Scholar
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2, 130–144.
Article Google Scholar
Kroll, B., & Reid, J. (1994). Guidelines for designing writing prompts: Clarifications, caveats, and cautions. Journal of Second Language Writing, 3, 231–255.
Article Google Scholar
Lederman, D. (2013). Public university accountability 2.0. Inside Higher Ed. Retrieved 30 Nov 2014 https://www.insidehighered.com/news/2013/05/06/public-university-accountability-system-expands-ways-report-student-learning#ixzz2SWQTtQUy\.
Linn, R. L., & Burton, E. (1994). Performance-based assessment: Implications of task specificity. Educational Measurement, 13, 5–8.
Article Google Scholar
MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84–99.
Article Google Scholar
McBee, M. M., & Barnes, L. L. B. (1998). The generalizability of a performance assessment measuring achievement in eighth-grade mathematics. Applied Measurement in Education, 11, 179–194.
Article Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: American Council on Education/MacMillan.
Google Scholar
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13–23.
Article Google Scholar
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational Measurement, 14, 5–8.
Article Google Scholar
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229–258.
Article Google Scholar
Moss, P. A., Girard, B. J., & Haniford, L. C. (2006). Validity in educational assessment. Review of Research in Education, 30, 109–162.
Article Google Scholar
Newman, F., Brandt, R., & Wiggins, G. (1998). An exchange of views on semantics, psychometrics, and assessment reform: A close look at ‘authentic assessments’. Educational Researcher, 27, 19–22.
Google Scholar
Nicholas, M., & Raider-Roth, M. (2011). Approaches used by faculty to assess critical-thinking—implications for general education. Online Submission.
Pike, G. R. (1992). The components of construct validity: A comparison of two measures of general education outcomes. The Journal of General Education, 41, 130–159.
Google Scholar
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks: Sage.
Google Scholar
Scherbaum, C. A., & Ferreter, J. M. (2009). Estimating statistical power and required sample sizes for organizational research using multilevel modeling. Organizational Research Methods, 12, 347–367.
Article Google Scholar
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language Testing, 22, 1–30. doi:10.1191/0265532205lt295oa.
Article Google Scholar
Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232.
Article Google Scholar
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Thousand Oaks: Sage.
Google Scholar
Spellings Report .(2006). A test of leadership: Charting the future of U.S. higher education. A report of the commission appointed by the Secretary of Education Margaret Spellings. Washington: U.S. Department of Education.
Thelk, A. D., Sundre, D. L., Horst, S. J., & Finney, S. J. (2009). Motivation matters: Using the Student Opinion Scale to make valid inferences about student performance. The Journal of General Education, 58, 129–151.
Article Google Scholar
VSA (2012). Voluntary system of accountability administration and reporting guidelines: AAC&U VALUE rubrics—demonstration project. Retrieved 30 Nov 2014 https://cp-files.s3.amazonaws.com/32/AAC_U_VALUE_Rubrics_Administration_Guidelines_20121210.pdf.
Wiggins, G. (1993). Assessing student performance: Exploring the purpose and limits of testing. San Francisco: Jossey-Bass.
Google Scholar
Wilkerson, J.R., & Lang, W.S. (2003). Portfolios, the Pied Piper of Teacher Certification Assessments: Legal and Psychometric Issues. Education Policy Analysis Archives, 11: 45. Retrieved 23 August 2014 http://epaa.asu.edu/epaa/v11n45/.
MSC (n.d.). Multistate collaborate to advance student learning outcomes. http://www.sheeo.org/projects/msc-multi-state-collaborative-advance-learning-outcomes-assessment.

Download references

Author information

Authors and Affiliations

Center for Assessment and Research Studies, James Madison University, Harrisonburg, VA, USA
John D. Hathcoat
Assessment, Division of Student Affairs, North Dakota State University, Fargo, USA
Jeremy D. Penn
School of Educational Studies, Oklahoma State University, Stillwater, USA
Laura L. B. Barnes
Department of Geography, Oklahoma State University, Stillwater, USA
Johnathan C. Comer

Authors

John D. Hathcoat
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy D. Penn
View author publications
You can also search for this author in PubMed Google Scholar
Laura L. B. Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Johnathan C. Comer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John D. Hathcoat.

Appendix 1

See Tables 5 and 6.

Table 5 Critical thinking rubric

Full size table

Table 6 Written communication rubric

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hathcoat, J.D., Penn, J.D., Barnes, L.L.B. et al. A Second Dystopia in Education: Validity Issues in Authentic Assessment Practices. Res High Educ 57, 892–912 (2016). https://doi.org/10.1007/s11162-016-9407-1

Download citation

Received: 12 December 2014
Published: 11 January 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s11162-016-9407-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Second Dystopia in Education: Validity Issues in Authentic Assessment Practices

Abstract

Access this article

Similar content being viewed by others

Adaptive Comparative Judgement

Adding authenticity to controlled conditions assessment: introduction of an online, open book, essay based exam

Equity, Ethics and Engagement: Principles for Quality Formative Assessment in Primary Science Classrooms

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Second Dystopia in Education: Validity Issues in Authentic Assessment Practices

Abstract

Access this article

Similar content being viewed by others

Adaptive Comparative Judgement

Adding authenticity to controlled conditions assessment: introduction of an online, open book, essay based exam

Equity, Ethics and Engagement: Principles for Quality Formative Assessment in Primary Science Classrooms

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation