Abstract
Demands for accountability have seen the implementation of large scale testing programs in Australia and internationally. There is, however, a growing body of evidence to show that externally imposed testing programs do not have a sustained impact on student achievement. It has been argued that teacher assessment is more effective in raising student achievement levels. However, it is also often argued that teacher assessments are less reliable than the results of testing programs. This paper presents a study in which teachers judged writing scripts using the process of pairwise comparison to generate a scale. The analysis showed high internal consistency of the teacher judgements. The scale locations from pairwise comparisons were highly correlated with scale estimates for the same students from a large-scale testing program. The results demonstrate it is possible to efficiently obtain highly reliable and valid teacher judgements using the process of pairwise comparison. Reliability indices are also provided for a series of small-scale assessments that used the same methodology in a range of other domains. The results support the findings of the main study. The article discusses the benefits of using the method to supplement and validate results from large-scale testing programs.
Similar content being viewed by others
References
Andrich, D. (1978a). Relationships between the Thurstone and Rasch approaches to item scaling.Applied Psychological Measurement, 2(3), 449–460.
Andrich, D. (1978b). A rating formulation for ordered response categories.Psychometrika, 43, 561–73.
Andrich, D. (1988).Rasch models for measurement. Beverly Hills: Sage Publications.
Andrich, D. (2006).A report to the Curriculum Council regarding assessment for tertiary selection. Perth: Curriculum Council of Western Australia. [Available from: www.curriculum.wa.edu.au/internet/_Documents/Publications/Andrich+Report.pdf].
Andrich, D., & Luo, G. (2003). Conditional Pairwise estimation in the Rasch model for ordered response categories using principle components.Journal of Applied Measurement, 4, 205–221.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning.Assessment in Education: Principles, Policy & Practice, 5(1), 7–74.
Bock, D. (1997). A brief history of item response theory.Educational Measurement: Issues and Practice, 16, 21–33.
Bond, T., & Caust, M. (2005, November).Silk purses from sows’ ears? Making measures from teacher judgements. Paper presented at the Australian Association for Research in Education Conference, Sydney [published January 2006].
Bond, T. G., & Fox, C. M. (2001).Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs, I. The method of paired comparisons.Biometrika, 39, 324–345.
Bramley, T., Bell, J. F., & Pollitt, A. (1998). Assessing changes in standards over time using Thurstone’s paired comparisons.Education Research and Perspectives, 2, 1–23.
Brookhart, S. M. (2003). Developing measurement theory for classroom assessment purposes and uses.Educational Measurement: Issues and Practice, 22(4), 5–12
Chudowsky, N., & Pellegrino, J. W. (2003). Large-scale assessments that support learning: what will it take?Theory into Practice, 42(1), 75–83.
Clarke, S., & Gipps, C. (2000). The role of teachers in teacher assessment in England, 1996–1998,Evaluation and Research in Education, 14(1), 38–52.
Department of Education and Training, Western Australia (1997).First Steps Writing Developmental Continuum. Richmond, Australia: Heinemann.
Gregory, K., & Clarke, M. (2003). High-stakes assessment in England and Singapore.Theory into Practice, 42(1), 66–74.
Groves, P. (2002). “Doesn’t it feel morbid here?” High stakes testing and the widening of the equity gap.Educational Foundations, 16(2), 15–31.
Gunzenhauser, M. (2003). High-stakes testing and the default philosophy of education.Theory into Practice, 42(1), 51–58.
Holme, B., & Humphry, S.M. (2008).PairWise software. Perth: University of Western Australia.
Louden, B., Chapman, E., Clarke, S., Cullity, M., & House, H. (2006).Evaluation of the Curriculum Improvement Program Phase 2. Report for the Department of Education and Training prepared in the Graduate School of Education, University of Western Australia. Accessed January 10, 2009, from http://www.det.wa.edu.au/education/ accountability/docs/curriculumreport.pdf
Luce, R. D. (1959).Individual Choice Behaviours: A theoretical analysis. New York: J. Wiley.
Luke, A., & Woods, A. (2007). Learning lessons: What No Child Left Behind can teach us about literacy, testing and accountability.QTU Professional Magazine, November, 5–9.
Masters, G. N. (1982). A Rasch model for partial credit scoring.Psychometrika, 47, 149–174.
Ministerial Council for Education, Employment, Training and Youth Affairs (2008).National declaration on educational goals for young Australians. Retrieved December 12, 2008, from http://www.mceetya.edu.au/mceetya/natgoals,24767.html
Performance Measurement Review Taskforce.A paper about the benefits of participating in national assessments. Retrieved December 12, 2008, from http://www.curriculum .edu.au/verve/_resources/Benefits_of_participation_in_national_assessments1.pdf
Rasch, G. (1961/1980). On General Laws and the Meaning of Measurement in Psychology. In J. Neyman (Ed.),Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 4: Contributions to Biology and Problems of Medicine, pp. 321–333. Berkeley: University of Chicago Press. [Available from: http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.bsmsp/1200512872]
Shepard, L. A. (2003). The hazards of high stakes testing.Issues in Science and Technology, 19(2), 53–58
Sloane, F. C., & Kelly, A. E. (2003). Issues in high-stakes testing programs.Theory into Practice, 42(1), 12.
Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data.Medical Research Methodology, 8, 33.
Stiggins, R. J. (2001).The unfulfilled promise of classroom assessment.Educational Measurement: Issues and Practice, 20(3) 5–15.
Thurstone, L. L. (1927). A law of comparative judgement.Psychological Review, 34, 278–286.
Thurstone, L. L. (1928). Attitudes can be measured.American Journal of Sociology, 33, 529–54.
Thurstone, L. L (1959).The measurement of values. Chicago, USA: The University of Chicago Press.
Wright, B. D., & Masters, G. N. (1982).Rating scale analysis. Chicago: MESA Press.
Wright, B. D., & Stone, M. H. (1979).Best test design. Chicago, IL: MESA Press.
Wyatt-Smith, C. (2000). Exploring the relationship between large-scale literacy testing programs and classroom-based assessment: A focus on teachers’ accounts.Australian Journal of Language and Literacy, 23(2), 109–127.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Heldsinger, S., Humphry, S. Using the method of pairwise comparison to obtain reliable teacher assessments. Aust. Educ. Res. 37, 1–19 (2010). https://doi.org/10.1007/BF03216919
Issue Date:
DOI: https://doi.org/10.1007/BF03216919