Skip to main content

Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement

  • Chapter
  • First Online:
Testing and Assessment of Interpreting

Part of the book series: New Frontiers in Translation Studies ((NFTS))

Abstract

Human raters play an important role in interpreting assessment. Their evaluative judgment of the quality of interpreted renditions produces quantitative measures (e.g., scores, ratings, marks, ranks) that form the basis of relevant decision-making (e.g., program admission, professional certification). Previous research in the field of language testing finds that human raters may be inconsistent over time, excessively harsh or lenient, and biased against a particular group of test candidates, a certain type of tasks and a given assessment criterion. Such undesirable phenomena (i.e., rater inconsistency, rater severity/leniency, and rater bias) are collectively known as rater effects. Their presence could lead to unreliable, invalid, and unfair assessments. It is therefore of importance to investigate possible rater effects in interpreting assessment. Although a number of statistical indices can be computed to measure rater effects, there has been no systematic attempt to compare their applicability and utility. Against this background, the current study aims to compare three psychometric approaches, namely, classical test theory, generalizability theory, and many-facet Rasch measurement, to detecting and measuring rater effects. Our analysis is based on the data from a previous assessment of English-to-Chinese simultaneous interpreting in which a total of nine raters were involved. Through our comparison, we hope that interpreting researchers and testers could obtain an in-depth understanding of statistical information generated by these approaches, and be able to make informed decisions in selecting an analytic approach commensurate with their local assessment needs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bachman, Lyle, Brian Lynch, and Maureen Mason. 1995. Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing 12 (2): 238–257.

    Article  Google Scholar 

  • Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press.

    Google Scholar 

  • Bond, Trevor, and Christine Fox. 2015. Applying the Rasch model: Fundamental measurement in the human sciences, 3rd ed. New York: Routledge.

    Book  Google Scholar 

  • Bonk, William, and Gary Ockey. 2003. A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1): 89–110.

    Article  Google Scholar 

  • Brennan, Robert. 2001a. An essay on the history and future of reliability from the perspective of replications. Journal of Educational Assessment 38 (4): 295–317.

    Google Scholar 

  • Brennan, Robert. 2001b. Generalizability theory. New York: Springer.

    Book  Google Scholar 

  • Cardinet, Jean, Sandra Johnson, and Gianreto Pini. 2010. Applying generalizability theory using EduG. New York, NY: Routledge.

    Google Scholar 

  • Clifford, Andrew. 2004. A preliminary investigation into discursive models of interpreting as a means of enhancing construct validity in interpreter certification. https://ruor.uottawa.ca/handle/10393/29086. Accessed 7 May 2019.

  • Crocker, Linda, and James Algina. 1986. Introduction to classical and modem test theory. Toronto: Holt, Rinehart and Winston.

    Google Scholar 

  • Cronbach, Lee, Goldine Gleser, Harinder Nanda, and Nageswari Rajaratnam. 1972. The dependability of behavioral measurements. New York: Wiley.

    Google Scholar 

  • DeVellis, Robert. 2006. Classical test theory. Medical Care 44 (1): 55–59.

    Google Scholar 

  • Eckes, Thomas. 2005. Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3): 197–221.

    Article  Google Scholar 

  • Eckes, Thomas. 2008. Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2): 155–185.

    Article  Google Scholar 

  • Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments, revised ed. Frankfurt am Main: Peter Lang.

    Google Scholar 

  • Fan, Xitao, and Shaojing Sun. 2014. Generalizability theory as a unifying framework of measurement reliability in adolescent research. The Journal of Early Adolescence 34 (1): 38–65.

    Article  Google Scholar 

  • Gile, Daniel. 1995. Fidelity assessment in consecutive interpretation: An experiment. Target 7 (1): 151–164.

    Article  Google Scholar 

  • Hale, Sandra, and Uldis Ozolins. 2014. Monolingual short courses for language-specific accreditation: Can they work? A Sydney experience. The Interpreter and Translator Trainer 8 (2): 1–23.

    Article  Google Scholar 

  • Han, Chao, and Helen Slatyer. 2016. Test validation in interpreter certification performance testing: An argument-based approach. Interpreting 18 (2): 231–258.

    Google Scholar 

  • Han, Chao, and Mehdi Riazi. 2017. Investigating the effects of speech rate and accent on simultaneous interpretation: A mixed-methods approach. Across Languages and Cultures 18 (2): 237–259.

    Article  Google Scholar 

  • Han, Chao, and Xiao Zhao. 2020. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment and Evaluation in Higher Education 46: 1–15. https://doi.org/10.1080/02602938.2020.1855624.

    Article  Google Scholar 

  • Han, Chao. 2015. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283.

    Article  Google Scholar 

  • Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201.

    Article  Google Scholar 

  • Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behaviour. Linguistica Antverpiensia, New Series: Themes in Translation Studies 16: 196–215.

    Google Scholar 

  • Han, Chao. 2018a. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English–Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58: 187–196.

    Article  Google Scholar 

  • Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of English–Chinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979–994.

    Article  Google Scholar 

  • Han, Chao. 2018c. Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting 20 (1): 59–95.

    Google Scholar 

  • Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36 (3): 419–438.

    Article  Google Scholar 

  • Kline, Theresa. 2005. Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage.

    Book  Google Scholar 

  • Kondo-Brown, Kimi. 2002. A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1): 3–31.

    Article  Google Scholar 

  • Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184.

    Article  Google Scholar 

  • Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17 (2): 226–254.

    Article  Google Scholar 

  • Linacre, John. 1989. FACETS: Computer program for many-facets Rasch measurement. Chicago: MESA Press.

    Google Scholar 

  • Linacre, John. 2013. A user’s guide to FACETS: Program manual 3.71.2. http://www.winsteps.com/a/facets-manual.pdf. Accessed 21 Oct 2019.

  • Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang.

    Google Scholar 

  • Lord, Frederic, Melvin Novick, and Allan Birnbaum. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

    Google Scholar 

  • Lumley, Tom, and Tim McNamara. 1995. Rater characteristics and rater bias: Implications for training. Language Testing 12 (1): 54–71.

    Article  Google Scholar 

  • Lynch, Brian, and Tim McNamara. 1998. Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2): 158–180.

    Article  Google Scholar 

  • Marcoulides, George, and Zvi Drezner. 1993. A procedure for transforming points in multi-dimensional space to a two-dimensional representation. Educational and Psychological Measurement 53 (4): 933–940.

    Article  Google Scholar 

  • Masters, Geoff. 1982. A Rasch model for partial credit scoring. Psychometrika 47 (2): 149–174.

    Article  Google Scholar 

  • McGraw, Kenneth O., and S.P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 (1): 30–46.

    Article  Google Scholar 

  • McNamara, Tim. 1996. Measuring second language performance. London: Longman.

    Google Scholar 

  • NAATI. 2019. Certified conference interpreter test assessment rubrics. https://www.naati.com.au/media/2357/cci_spoken_assessment_rubrics.pdf. Accessed 20 Mar 2020.

  • Schaefer, Edward. 2008. Rater bias pattern in an EFL writing assessment. Language Testing 25 (4): 465–493.

    Article  Google Scholar 

  • Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Shang, Xiaoqi, and Guixia Xie. 2020. Aptitude for interpreting revisited: Predictive validity of recall across languages. The Interpreter and Translator Trainer 14 (3): 344–361.

    Article  Google Scholar 

  • Shavelson, Richard, and Noreen M. Webb. 1991. Generalizability theory: A primer. Newbury Park, CA: Sage.

    Google Scholar 

  • Shrout, Patrick, and Jeseph Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–428.

    Article  Google Scholar 

  • Shultz, Kenneth, and David Whitney. 2005. Measurement theory in action: Case studies and exercises. Thousand Oaks, CA: Sage.

    Book  Google Scholar 

  • Sudweeks, Richard, Suzanne Reeve, and William S. Bradshaw. 2005. A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 9 (3): 239–261.

    Article  Google Scholar 

  • Tiselius, Elisabet. 2009. Revisiting Carroll’s scales. In Testing and assessment in translation and interpreting studies, ed. Claudia V. Angelelli and Holly E. Jacobson, 95–121. Amsterdam: John Benjamins.

    Google Scholar 

  • Traub, Ross, and Glenn L. Rowley. 1991. An NCME instructional module: Understanding reliability. Educational Measurement: Issues and Practices 10 (1): 37–45.

    Article  Google Scholar 

  • van Weeren, J., and T.J.J.M. Theunissen. 1987. Testing pronunciation: An application of generalizability theory. Language Learning 37 (1): 109–122.

    Article  Google Scholar 

  • Wang, Weiwei, Xu Yi, Wang Binghua, and Mu Lei. 2020. Developing interpreting competence scales in China. Frontiers in Psychology 11: 481. https://doi.org/10.3389/fpsyg.2020.00481.

    Article  Google Scholar 

  • Webb, Noreen, and Richard J. Shavelson. 2005. Generalizability theory: Overview. In Encyclopedia of Statistics in Behavioral Science, ed. S. Everitt Brian and David C. Howell, 717–719. Chichester: Wiley.

    Google Scholar 

  • Weigle, Sara. 1998. Using FACETS to model rater training effects. Language Testing 15 (2): 263–287.

    Article  Google Scholar 

  • Wen, Qian. 2019. A many-facet Rasch model validation study on business negotiation interpreting test. Foreign Languages in China 16 (3): 73–82.

    Google Scholar 

  • Wigglesworth, Gillian. 1993. Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3): 305–319.

    Article  Google Scholar 

  • Wu, Shao-Chuan. 2010. Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/handle/10443/1122. Accessed 15 Apr 2019.

  • Zhao, Nan, and Yanping Dong. 2013. Validation of a consecutive interpreting test based on multi-faceted Rasch model. Journal of PLA University of Foreign Languages 36 (1): 86–90.

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Social Science Foundation (grant number: 18AYY004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Han .

Editor information

Editors and Affiliations

Appendix 1 The Analytic Rating Scale

Appendix 1 The Analytic Rating Scale

Band/scoring criteria

Information completeness (InfoCom)

Fluency of delivery (FluDel)

Target language quality (TLQual)

Band 4

(Score range:

7–8)

A substantial amount of original messages delivered (i.e., > 90%), with a few number of deviations, inaccuracies, and minor/major omissions

Delivery on the whole fluent, containing a few disfluencies such as (un)filled pauses, long silence, fillers and/or excessive repairs

Target language idiomatic and on the whole correct, with only a few instances of unnatural expressions and grammatical errors

Band 3

(Score range:

5–6)

Majority of original messages delivered (i.e., 60–70%), with a small number of deviations, inaccuracies, and minor/major omissions

Delivery on the whole generally fluent, containing a small number of disfluencies

Target language generally idiomatic and on the whole mostly correct, with a small amount of instances of unnatural expressions and grammatical errors

Band 2

(Score range:

3–4)

About half of original messages delivered (i.e., 40–50%), with many instances of deviations, inaccuracies, and minor/major omissions

Delivery rather fluent. Acceptable, but with regular disfluencies

Target language to a certain degree both idiomatic and correct. Acceptable, but contains many instances of unnatural expressions and grammatical errors

Band 1

(Score range:

1–2)

A small portion of original messages delivered (i.e., < 30%), with frequent occurrences of deviations, inaccuracies, and minor/major omissions, to such a degree that listeners may doubt the integrity of renditions

Delivery lacks fluency. It is frequently hampered by disfluencies, to such a degree that they may impede comprehension

Target language stilted, lacking in idiomaticity, and containing frequent grammatical errors, to such a degree that it may impede comprehension

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Han, C. (2021). Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement. In: Chen, J., Han, C. (eds) Testing and Assessment of Interpreting. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-15-8554-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-8554-8_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-8553-1

  • Online ISBN: 978-981-15-8554-8

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics