The problem of assessing problem solving: can comparative judgement help?

Jones, Ian; Inglis, Matthew

doi:10.1007/s10649-015-9607-1

The problem of assessing problem solving: can comparative judgement help?

Published: 29 May 2015

Volume 89, pages 337–355, (2015)
Cite this article

Educational Studies in Mathematics Aims and scope Submit manuscript

Ian Jones¹ &
Matthew Inglis¹

2298 Accesses
36 Citations
14 Altmetric
Explore all metrics

Abstract

School mathematics examination papers are typically dominated by short, structured items that fail to assess sustained reasoning or problem solving. A contributory factor to this situation is the need for student work to be marked reliably by a large number of markers of varied experience and competence. We report a study that tested an alternative approach to assessment, called comparative judgement, which may represent a superior method for assessing open-ended questions that encourage a range of unpredictable responses. An innovative problem solving examination paper was specially designed by examiners, evaluated by mathematics teachers, and administered to 750 secondary school students of varied mathematical achievement. The students’ work was then assessed by mathematics education experts using comparative judgement as well as a specially designed, resource-intensive marking procedure. We report two main findings from the research. First, the examination paper writers, when freed from the traditional constraint of producing a mark scheme, designed questions that were less structured and more problem-based than is typical in current school mathematics examination papers. Second, the comparative judgement approach to assessing the student work proved successful by our measures of inter-rater reliability and validity. These findings open new avenues for how school mathematics, and indeed other areas of the curriculum, might be assessed in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Article Open access 07 June 2017

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Article Open access 10 December 2019

Ethical Considerations of Conducting Systematic Reviews in Educational Research

References

ACME. (2005). Assessment in 14–19 Mathematics. London: Advisory Committee on Mathematics Education.
Google Scholar
ACME. (2011). Mathematical needs: Mathematics in the workplace and in higher education. London: Advisory Committee on Mathematics Education.
Google Scholar
AQA. (2010). GCSE Foundation Tier Mathematics Paper 1 (Specification A). Monday 7 June 2010. Manchester: Assessment and Qualifications Alliance.
Google Scholar
Berube, C.T. (2004). Are standards preventing good teaching? Clearing House, 77, 264–267.
Article Google Scholar
Black, P. (2008). Strategic decisions: Ambitions, feasibility and context. Educational Designer, 1(1). Retrieved from http://www.educationaldesigner.org/ed/volume1/issue1/article1/
Black, P. at al. (2012). High-stakes examinations to support policy. Educational Designer, 2(5). Retrieved from http://www.educationaldesigner.org/ed/volume2/issue5/article16/
Borsboom, D., Mellenbergh, G.J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071.
Article Google Scholar
Bramley, T. (2007). Paired comparison methods. In P. Newton, J.-A. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 264–294). London: Qualifications and Curriculum Authority.
Google Scholar
Bramley, T., Bell, J., & Pollitt, A. (1998). Assessing changes in standards over time using Thurstone paired comparisons. Education Research and Perspectives, 25, 1–24.
Google Scholar
Burkhardt, H. (2009). On strategic design. Educational Designer, 1(3). Retrieved from http://www.educationaldesigner.org/ed/volume1/issue3/article9/
Crisp, V. (2008). Exploring the nature of examiner thinking during the process of examination marking. Cambridge Journal of Education, 38, 247–264.
Article Google Scholar
Cronbach, L.J. (1988). Five perspectives on the validity argument. In H. Wainer & H.I. Braun (Eds.), Test validity (pp. 3–17). Hillsdale: Lawrence Erlbaum Associates, Inc.
Google Scholar
Derrick, K. (2012). Developing the e-scape software system. International Journal of Technology and Design Education, 22, 171–185.
Article Google Scholar
Duncan, A. (2010). Beyond the bubble tests: The next generation of assessments. Alexandria, VA, Secretary Arne Duncan’s Remarks to State Leaders at Achieve’s American Diploma Project Leadership Team Meeting. Retrieved from http://www.ed.gov/news/speeches/beyond-bubble-tests-next-generation-assessments-secretary-arne-duncans-remarks-state-l
Gewertz, C. (2012). Consortia provide preview of common assessments. Education Week, 32, 18–19.
Google Scholar
Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37, 1–19.
Article Google Scholar
James, C. (1974). The consistency of marking a physics examination. Physics Education, 9, 271–274.
Article Google Scholar
Jones, I., & Alcock, L. (2014). Peer assessment without assessment criteria. Studies in Higher Education, 39, 1774–1787.
Article Google Scholar
Jones, I., Inglis, M., Gilmore, C., & Hodgen, J. (2013). Measuring conceptual understanding: The case of fractions. In A.M. Lindmeier & A. Heinze (Eds.), Proceedings of the 37th Conference of the International Group for the Psychology of Mathematics Education (Vol. 3, pp. 113–120). Kiel: PME.
Google Scholar
Jones, I., Swan, M., & Pollitt, A. (2014). Assessing mathematical problem solving using comparative judgement. International Journal of Science and Mathematics Education, 13, 151–177.
Article Google Scholar
Kimbell, R. (2012). Evolving project e-scape for national assessment. International Journal of Technology and Design Education, 22, 135–155.
Article Google Scholar
Koretz, D. (2008). Measuring up: What educational testing really tells us. Cambridge: Harvard University Press.
Google Scholar
Laming, D. (1984). The relativity of “absolute” judgements. British Journal of Mathematical and Statistical Psychology, 37, 152–183.
Article Google Scholar
McLester, S., & McIntire, T. (2006). The workforce readiness crisis: We’re not turning out employable graduates nor maintaining our position as a global competitor—why? Technology and Learning, 27, 22–28.
Google Scholar
McMahon, S., & Jones, I. (2014). A comparative judgement approach to teacher assessment. Assessment in Education: Principles Policy and Practice. doi:10.1080/0969594X.2014.978839
Google Scholar
McVey, P.J. (1976). The “paper error” of two examinations in electronic engineering. Physics Education, 11, 58–60.
Article Google Scholar
MEI. (2012). Integrating mathematical problem solving: Applying Mathematics and Statistics across the curriculum at level 3. End of project report. London: Mathematics in Education and Industry.
Google Scholar
Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Article Google Scholar
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11.
Article Google Scholar
Murphy, R. (1979). Removing the marks from examination scripts before re-marking them: Does it make any difference? British Journal of Educational Psychology, 49, 73–78.
Article Google Scholar
Murphy, R. (1982). A further report of investigations into the reliability of marking of GCE examinations. British Journal of Educational Psychology, 52, 58–63.
Article Google Scholar
NCETM. (2009). Mathematics matters: Final report. London: National Centre for Excellence in the Teaching of Mathematics.
Google Scholar
Newton, P. (1996). The reliability of marking of general certificate of secondary education scripts: Mathematics and English. British Educational Research Journal, 22, 405–420.
Article Google Scholar
Newton, P., & Shaw, S. (2014). Validity in educational and psychological assessment. London: Sage.
Book Google Scholar
Noyes, A., Wake, G., Drake, P., & Murphy, R. (2011). Evaluating Mathematics pathways final report. DfE Research Report 143. London: Department for Education.
Google Scholar
Ofsted. (2008). Mathematics: Understanding the score. London: Office for Standards in Education.
Google Scholar
Pollitt, A. (2012a). The method of adaptive comparative judgement. Assessment in Education: Principles Policy and Practice, 19, 281–300.
Article Google Scholar
Pollitt, A. (2012b). Comparative judgement for assessment. International Journal of Technology and Design Education, 22, 157–170.
Article Google Scholar
Pollitt, A., & Murray, N. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th language testing research colloquium (pp. 74–91). Cambridge: Cambridge University Press.
Google Scholar
Popham, W.J. (2001). Teaching to the test? Educational Leadership, 58, 16–20.
Google Scholar
QCA (2007). National curriculum 2007. Coventry: Qualifications and curriculum authority.
Research Committee, N.C.T.M. (2013). New assessments for new standards: The potential transformation of mathematics education and its research implications. Journal for Research in Mathematics Education, 44, 340–352.
Article Google Scholar
Seery, N., Canty, D., & Phelan, P. (2012). The validity and value of peer assessment using adaptive comparative judgement in design driven practical education. International Journal of Technology and Design Education, 22, 205–226.
Article Google Scholar
Shepard, L.A. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16, 5–24.
Article Google Scholar
Silver, E.A., Ghousseini, H., Gosen, D., Charalambous, C., & Font Strawhun, B.T. (2005). Moving from rhetoric to praxis: Issues faced by teachers in having students consider multiple solutions for problems in the mathematics classroom. Journal of Mathematical Behavior, 24, 287–301.
Article Google Scholar
Suto, I. (2013). 21st Century skills: Ancient, ubiquitous, enigmatic? Research Matters: A Cambridge Assessment Publication, 15, 2–8.
Google Scholar
Suto, I., & Greatorex, J. (2008). What goes through an examiner’s mind? Using verbal protocols to gain insights into the GCSE marking process. British Educational Research Journal, 34, 213–233.
Article Google Scholar
Suto, I., & Nadas, R. (2009). Why are some GCSE examination questions harder to mark accurately than others? Using Kelly’s repertory grid technique to identify relevant question features. Research Papers in Education, 24, 335–377.
Article Google Scholar
Swan, M. (2014). Improving the alignment between values, principles and classroom realities. In Y. Li & G. Lappan (Eds.), Mathematics curriculum in school education (pp. 621–636). Dordrecht: Springer.
Chapter Google Scholar
Swan, M., & Burkhardt, H. (2012). Designing assessment of performance in mathematics. Educational Designer, 2(5). Retrieved from http://www.educationaldesigner.org/ed/volume2/issue5/article19/
Taggart, G.L., Phifer, S.J., Nixon, J.A., & Wood, M. (1998). Rubrics: A handbook for construction and use. Lancaster: Technomic Publishing.
Google Scholar
Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review, 34, 273–286.
Article Google Scholar
Truss, E. (2012). Elizabeth Truss calls for a renaissance in maths. Norfolk: Speech to the National Education Trust. Retrieved from https://www.gov.uk/government/speeches/elizabeth-truss-calls-for-a-renaissance-in-maths
Turner, H., & Firth, D. (2005). Bradley-Terry models in R: The BradleyTerry2 package. Journal of Statistical Software, 12(1). Retrieved from http://www.jstatsoft.org/v12/i01
van Aalst, J., & Chan, C.K.K. (2007). Student-directed assessment of knowledge building using electronic portfolios. Journal of the Learning Sciences, 16, 175–220.
Article Google Scholar
Vordermann, C., Porkess, R., Budd, C., Dunne, R., & Rahman-Hart, P. (2011). A world-class Mathematics education for all our young people. London: The Conservative Party.
Google Scholar
Walport, M., Goodfellow, J., McLoughlin, F., Post, M., Sjøvoll, J., Taylor, M., et al. (2010). Science and Mathematics secondary education for the 21st century: Report of the science and learning expert group. London: Department for Business, Industry and Skills.
Google Scholar
Wiliam, D. (2001). Reliability, validity, and all that jazz. Education 3–13: International Journal of Primary Elementary and Early Years Education, 29, 17–21.
Google Scholar
Willmott, A.S., & Nuttall, D.L. (1975). The reliability of examinations at 16+. London: Macmillan Education.
Google Scholar

Download references

Acknowledgments

This work was supported by a Royal Society Shuttleworth Research Fellowship to IJ, a Royal Society Worshipful Company of Actuaries Research Fellowship to MI, and the Nuffield Foundation.

Author information

Authors and Affiliations

Mathematics Education Centre, Loughborough University, Loughborough, LE11 3TU, UK
Ian Jones & Matthew Inglis

Authors

Ian Jones
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Inglis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ian Jones.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 6.01 mb)

ESM 2

(PDF 216 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jones, I., Inglis, M. The problem of assessing problem solving: can comparative judgement help?. Educ Stud Math 89, 337–355 (2015). https://doi.org/10.1007/s10649-015-9607-1

Download citation

Published: 29 May 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s10649-015-9607-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The problem of assessing problem solving: can comparative judgement help?

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The problem of assessing problem solving: can comparative judgement help?

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Ethical Considerations of Conducting Systematic Reviews in Educational Research

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation