Abstract
Human raters play an important role in interpreting assessment. Their evaluative judgment of the quality of interpreted renditions produces quantitative measures (e.g., scores, ratings, marks, ranks) that form the basis of relevant decision-making (e.g., program admission, professional certification). Previous research in the field of language testing finds that human raters may be inconsistent over time, excessively harsh or lenient, and biased against a particular group of test candidates, a certain type of tasks and a given assessment criterion. Such undesirable phenomena (i.e., rater inconsistency, rater severity/leniency, and rater bias) are collectively known as rater effects. Their presence could lead to unreliable, invalid, and unfair assessments. It is therefore of importance to investigate possible rater effects in interpreting assessment. Although a number of statistical indices can be computed to measure rater effects, there has been no systematic attempt to compare their applicability and utility. Against this background, the current study aims to compare three psychometric approaches, namely, classical test theory, generalizability theory, and many-facet Rasch measurement, to detecting and measuring rater effects. Our analysis is based on the data from a previous assessment of English-to-Chinese simultaneous interpreting in which a total of nine raters were involved. Through our comparison, we hope that interpreting researchers and testers could obtain an in-depth understanding of statistical information generated by these approaches, and be able to make informed decisions in selecting an analytic approach commensurate with their local assessment needs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bachman, Lyle, Brian Lynch, and Maureen Mason. 1995. Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing 12 (2): 238–257.
Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press.
Bond, Trevor, and Christine Fox. 2015. Applying the Rasch model: Fundamental measurement in the human sciences, 3rd ed. New York: Routledge.
Bonk, William, and Gary Ockey. 2003. A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1): 89–110.
Brennan, Robert. 2001a. An essay on the history and future of reliability from the perspective of replications. Journal of Educational Assessment 38 (4): 295–317.
Brennan, Robert. 2001b. Generalizability theory. New York: Springer.
Cardinet, Jean, Sandra Johnson, and Gianreto Pini. 2010. Applying generalizability theory using EduG. New York, NY: Routledge.
Clifford, Andrew. 2004. A preliminary investigation into discursive models of interpreting as a means of enhancing construct validity in interpreter certification. https://ruor.uottawa.ca/handle/10393/29086. Accessed 7 May 2019.
Crocker, Linda, and James Algina. 1986. Introduction to classical and modem test theory. Toronto: Holt, Rinehart and Winston.
Cronbach, Lee, Goldine Gleser, Harinder Nanda, and Nageswari Rajaratnam. 1972. The dependability of behavioral measurements. New York: Wiley.
DeVellis, Robert. 2006. Classical test theory. Medical Care 44 (1): 55–59.
Eckes, Thomas. 2005. Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3): 197–221.
Eckes, Thomas. 2008. Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2): 155–185.
Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments, revised ed. Frankfurt am Main: Peter Lang.
Fan, Xitao, and Shaojing Sun. 2014. Generalizability theory as a unifying framework of measurement reliability in adolescent research. The Journal of Early Adolescence 34 (1): 38–65.
Gile, Daniel. 1995. Fidelity assessment in consecutive interpretation: An experiment. Target 7 (1): 151–164.
Hale, Sandra, and Uldis Ozolins. 2014. Monolingual short courses for language-specific accreditation: Can they work? A Sydney experience. The Interpreter and Translator Trainer 8 (2): 1–23.
Han, Chao, and Helen Slatyer. 2016. Test validation in interpreter certification performance testing: An argument-based approach. Interpreting 18 (2): 231–258.
Han, Chao, and Mehdi Riazi. 2017. Investigating the effects of speech rate and accent on simultaneous interpretation: A mixed-methods approach. Across Languages and Cultures 18 (2): 237–259.
Han, Chao, and Xiao Zhao. 2020. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment and Evaluation in Higher Education 46: 1–15. https://doi.org/10.1080/02602938.2020.1855624.
Han, Chao. 2015. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283.
Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201.
Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behaviour. Linguistica Antverpiensia, New Series: Themes in Translation Studies 16: 196–215.
Han, Chao. 2018a. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English–Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58: 187–196.
Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of English–Chinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979–994.
Han, Chao. 2018c. Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting 20 (1): 59–95.
Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36 (3): 419–438.
Kline, Theresa. 2005. Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage.
Kondo-Brown, Kimi. 2002. A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1): 3–31.
Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184.
Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17 (2): 226–254.
Linacre, John. 1989. FACETS: Computer program for many-facets Rasch measurement. Chicago: MESA Press.
Linacre, John. 2013. A user’s guide to FACETS: Program manual 3.71.2. http://www.winsteps.com/a/facets-manual.pdf. Accessed 21 Oct 2019.
Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang.
Lord, Frederic, Melvin Novick, and Allan Birnbaum. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Lumley, Tom, and Tim McNamara. 1995. Rater characteristics and rater bias: Implications for training. Language Testing 12 (1): 54–71.
Lynch, Brian, and Tim McNamara. 1998. Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2): 158–180.
Marcoulides, George, and Zvi Drezner. 1993. A procedure for transforming points in multi-dimensional space to a two-dimensional representation. Educational and Psychological Measurement 53 (4): 933–940.
Masters, Geoff. 1982. A Rasch model for partial credit scoring. Psychometrika 47 (2): 149–174.
McGraw, Kenneth O., and S.P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 (1): 30–46.
McNamara, Tim. 1996. Measuring second language performance. London: Longman.
NAATI. 2019. Certified conference interpreter test assessment rubrics. https://www.naati.com.au/media/2357/cci_spoken_assessment_rubrics.pdf. Accessed 20 Mar 2020.
Schaefer, Edward. 2008. Rater bias pattern in an EFL writing assessment. Language Testing 25 (4): 465–493.
Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins.
Shang, Xiaoqi, and Guixia Xie. 2020. Aptitude for interpreting revisited: Predictive validity of recall across languages. The Interpreter and Translator Trainer 14 (3): 344–361.
Shavelson, Richard, and Noreen M. Webb. 1991. Generalizability theory: A primer. Newbury Park, CA: Sage.
Shrout, Patrick, and Jeseph Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–428.
Shultz, Kenneth, and David Whitney. 2005. Measurement theory in action: Case studies and exercises. Thousand Oaks, CA: Sage.
Sudweeks, Richard, Suzanne Reeve, and William S. Bradshaw. 2005. A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 9 (3): 239–261.
Tiselius, Elisabet. 2009. Revisiting Carroll’s scales. In Testing and assessment in translation and interpreting studies, ed. Claudia V. Angelelli and Holly E. Jacobson, 95–121. Amsterdam: John Benjamins.
Traub, Ross, and Glenn L. Rowley. 1991. An NCME instructional module: Understanding reliability. Educational Measurement: Issues and Practices 10 (1): 37–45.
van Weeren, J., and T.J.J.M. Theunissen. 1987. Testing pronunciation: An application of generalizability theory. Language Learning 37 (1): 109–122.
Wang, Weiwei, Xu Yi, Wang Binghua, and Mu Lei. 2020. Developing interpreting competence scales in China. Frontiers in Psychology 11: 481. https://doi.org/10.3389/fpsyg.2020.00481.
Webb, Noreen, and Richard J. Shavelson. 2005. Generalizability theory: Overview. In Encyclopedia of Statistics in Behavioral Science, ed. S. Everitt Brian and David C. Howell, 717–719. Chichester: Wiley.
Weigle, Sara. 1998. Using FACETS to model rater training effects. Language Testing 15 (2): 263–287.
Wen, Qian. 2019. A many-facet Rasch model validation study on business negotiation interpreting test. Foreign Languages in China 16 (3): 73–82.
Wigglesworth, Gillian. 1993. Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3): 305–319.
Wu, Shao-Chuan. 2010. Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/handle/10443/1122. Accessed 15 Apr 2019.
Zhao, Nan, and Yanping Dong. 2013. Validation of a consecutive interpreting test based on multi-faceted Rasch model. Journal of PLA University of Foreign Languages 36 (1): 86–90.
Acknowledgements
This work was supported by National Social Science Foundation (grant number: 18AYY004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix 1 The Analytic Rating Scale
Appendix 1 The Analytic Rating Scale
Band/scoring criteria | Information completeness (InfoCom) | Fluency of delivery (FluDel) | Target language quality (TLQual) |
---|---|---|---|
Band 4 (Score range: 7–8) | A substantial amount of original messages delivered (i.e., > 90%), with a few number of deviations, inaccuracies, and minor/major omissions | Delivery on the whole fluent, containing a few disfluencies such as (un)filled pauses, long silence, fillers and/or excessive repairs | Target language idiomatic and on the whole correct, with only a few instances of unnatural expressions and grammatical errors |
Band 3 (Score range: 5–6) | Majority of original messages delivered (i.e., 60–70%), with a small number of deviations, inaccuracies, and minor/major omissions | Delivery on the whole generally fluent, containing a small number of disfluencies | Target language generally idiomatic and on the whole mostly correct, with a small amount of instances of unnatural expressions and grammatical errors |
Band 2 (Score range: 3–4) | About half of original messages delivered (i.e., 40–50%), with many instances of deviations, inaccuracies, and minor/major omissions | Delivery rather fluent. Acceptable, but with regular disfluencies | Target language to a certain degree both idiomatic and correct. Acceptable, but contains many instances of unnatural expressions and grammatical errors |
Band 1 (Score range: 1–2) | A small portion of original messages delivered (i.e., < 30%), with frequent occurrences of deviations, inaccuracies, and minor/major omissions, to such a degree that listeners may doubt the integrity of renditions | Delivery lacks fluency. It is frequently hampered by disfluencies, to such a degree that they may impede comprehension | Target language stilted, lacking in idiomaticity, and containing frequent grammatical errors, to such a degree that it may impede comprehension |
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Han, C. (2021). Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement. In: Chen, J., Han, C. (eds) Testing and Assessment of Interpreting. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-15-8554-8_5
Download citation
DOI: https://doi.org/10.1007/978-981-15-8554-8_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8553-1
Online ISBN: 978-981-15-8554-8
eBook Packages: EducationEducation (R0)