Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement

Han, Chao

doi:10.1007/978-981-15-8554-8_5

Chao Han⁴

Part of the book series: New Frontiers in Translation Studies ((NFTS))

598 Accesses
2 Citations

Abstract

Human raters play an important role in interpreting assessment. Their evaluative judgment of the quality of interpreted renditions produces quantitative measures (e.g., scores, ratings, marks, ranks) that form the basis of relevant decision-making (e.g., program admission, professional certification). Previous research in the field of language testing finds that human raters may be inconsistent over time, excessively harsh or lenient, and biased against a particular group of test candidates, a certain type of tasks and a given assessment criterion. Such undesirable phenomena (i.e., rater inconsistency, rater severity/leniency, and rater bias) are collectively known as rater effects. Their presence could lead to unreliable, invalid, and unfair assessments. It is therefore of importance to investigate possible rater effects in interpreting assessment. Although a number of statistical indices can be computed to measure rater effects, there has been no systematic attempt to compare their applicability and utility. Against this background, the current study aims to compare three psychometric approaches, namely, classical test theory, generalizability theory, and many-facet Rasch measurement, to detecting and measuring rater effects. Our analysis is based on the data from a previous assessment of English-to-Chinese simultaneous interpreting in which a total of nine raters were involved. Through our comparison, we hope that interpreting researchers and testers could obtain an in-depth understanding of statistical information generated by these approaches, and be able to make informed decisions in selecting an analytic approach commensurate with their local assessment needs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bachman, Lyle, Brian Lynch, and Maureen Mason. 1995. Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing 12 (2): 238–257.
Article Google Scholar
Bachman, Lyle. 1990. Fundamental considerations in language testing. Oxford: Oxford University Press.
Google Scholar
Bond, Trevor, and Christine Fox. 2015. Applying the Rasch model: Fundamental measurement in the human sciences, 3rd ed. New York: Routledge.
Book Google Scholar
Bonk, William, and Gary Ockey. 2003. A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1): 89–110.
Article Google Scholar
Brennan, Robert. 2001a. An essay on the history and future of reliability from the perspective of replications. Journal of Educational Assessment 38 (4): 295–317.
Google Scholar
Brennan, Robert. 2001b. Generalizability theory. New York: Springer.
Book Google Scholar
Cardinet, Jean, Sandra Johnson, and Gianreto Pini. 2010. Applying generalizability theory using EduG. New York, NY: Routledge.
Google Scholar
Clifford, Andrew. 2004. A preliminary investigation into discursive models of interpreting as a means of enhancing construct validity in interpreter certification. https://ruor.uottawa.ca/handle/10393/29086. Accessed 7 May 2019.
Crocker, Linda, and James Algina. 1986. Introduction to classical and modem test theory. Toronto: Holt, Rinehart and Winston.
Google Scholar
Cronbach, Lee, Goldine Gleser, Harinder Nanda, and Nageswari Rajaratnam. 1972. The dependability of behavioral measurements. New York: Wiley.
Google Scholar
DeVellis, Robert. 2006. Classical test theory. Medical Care 44 (1): 55–59.
Google Scholar
Eckes, Thomas. 2005. Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3): 197–221.
Article Google Scholar
Eckes, Thomas. 2008. Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2): 155–185.
Article Google Scholar
Eckes, Thomas. 2015. Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments, revised ed. Frankfurt am Main: Peter Lang.
Google Scholar
Fan, Xitao, and Shaojing Sun. 2014. Generalizability theory as a unifying framework of measurement reliability in adolescent research. The Journal of Early Adolescence 34 (1): 38–65.
Article Google Scholar
Gile, Daniel. 1995. Fidelity assessment in consecutive interpretation: An experiment. Target 7 (1): 151–164.
Article Google Scholar
Hale, Sandra, and Uldis Ozolins. 2014. Monolingual short courses for language-specific accreditation: Can they work? A Sydney experience. The Interpreter and Translator Trainer 8 (2): 1–23.
Article Google Scholar
Han, Chao, and Helen Slatyer. 2016. Test validation in interpreter certification performance testing: An argument-based approach. Interpreting 18 (2): 231–258.
Google Scholar
Han, Chao, and Mehdi Riazi. 2017. Investigating the effects of speech rate and accent on simultaneous interpretation: A mixed-methods approach. Across Languages and Cultures 18 (2): 237–259.
Article Google Scholar
Han, Chao, and Xiao Zhao. 2020. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment and Evaluation in Higher Education 46: 1–15. https://doi.org/10.1080/02602938.2020.1855624.
Article Google Scholar
Han, Chao. 2015. Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2): 255–283.
Article Google Scholar
Han, Chao. 2016. Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly 13 (3): 186–201.
Article Google Scholar
Han, Chao. 2017. Using analytic rating scales to assess English–Chinese bi-directional interpreting: A longitudinal Rasch analysis of scale utility and rater behaviour. Linguistica Antverpiensia, New Series: Themes in Translation Studies 16: 196–215.
Google Scholar
Han, Chao. 2018a. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English–Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58: 187–196.
Article Google Scholar
Han, Chao. 2018b. Latent trait modelling of rater accuracy in formative peer assessment of English–Chinese consecutive interpreting. Assessment and Evaluation in Higher Education 43 (6): 979–994.
Article Google Scholar
Han, Chao. 2018c. Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting 20 (1): 59–95.
Google Scholar
Han, Chao. 2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36 (3): 419–438.
Article Google Scholar
Kline, Theresa. 2005. Psychological testing: A practical approach to design and evaluation. Thousand Oaks, CA: Sage.
Book Google Scholar
Kondo-Brown, Kimi. 2002. A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1): 3–31.
Article Google Scholar
Lee, Jieun. 2008. Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2): 165–184.
Article Google Scholar
Lee, Sang-Bin. 2015. Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting 17 (2): 226–254.
Article Google Scholar
Linacre, John. 1989. FACETS: Computer program for many-facets Rasch measurement. Chicago: MESA Press.
Google Scholar
Linacre, John. 2013. A user’s guide to FACETS: Program manual 3.71.2. http://www.winsteps.com/a/facets-manual.pdf. Accessed 21 Oct 2019.
Liu, Minhua. 2013. Design and analysis of Taiwan’s interpretation certification examination. In Assessment issues in language translation and interpreting, ed. Dina Tsagari and Roelof van Deemter, 163–178. Frankfurt: Peter Lang.
Google Scholar
Lord, Frederic, Melvin Novick, and Allan Birnbaum. 1968. Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Google Scholar
Lumley, Tom, and Tim McNamara. 1995. Rater characteristics and rater bias: Implications for training. Language Testing 12 (1): 54–71.
Article Google Scholar
Lynch, Brian, and Tim McNamara. 1998. Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2): 158–180.
Article Google Scholar
Marcoulides, George, and Zvi Drezner. 1993. A procedure for transforming points in multi-dimensional space to a two-dimensional representation. Educational and Psychological Measurement 53 (4): 933–940.
Article Google Scholar
Masters, Geoff. 1982. A Rasch model for partial credit scoring. Psychometrika 47 (2): 149–174.
Article Google Scholar
McGraw, Kenneth O., and S.P. Wong. 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 (1): 30–46.
Article Google Scholar
McNamara, Tim. 1996. Measuring second language performance. London: Longman.
Google Scholar
NAATI. 2019. Certified conference interpreter test assessment rubrics. https://www.naati.com.au/media/2357/cci_spoken_assessment_rubrics.pdf. Accessed 20 Mar 2020.
Schaefer, Edward. 2008. Rater bias pattern in an EFL writing assessment. Language Testing 25 (4): 465–493.
Article Google Scholar
Setton, Robin, and Andrew Dawrant. 2016. Conference interpreting: A trainer’s guide. Amsterdam: John Benjamins.
Book Google Scholar
Shang, Xiaoqi, and Guixia Xie. 2020. Aptitude for interpreting revisited: Predictive validity of recall across languages. The Interpreter and Translator Trainer 14 (3): 344–361.
Article Google Scholar
Shavelson, Richard, and Noreen M. Webb. 1991. Generalizability theory: A primer. Newbury Park, CA: Sage.
Google Scholar
Shrout, Patrick, and Jeseph Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86 (2): 420–428.
Article Google Scholar
Shultz, Kenneth, and David Whitney. 2005. Measurement theory in action: Case studies and exercises. Thousand Oaks, CA: Sage.
Book Google Scholar
Sudweeks, Richard, Suzanne Reeve, and William S. Bradshaw. 2005. A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 9 (3): 239–261.
Article Google Scholar
Tiselius, Elisabet. 2009. Revisiting Carroll’s scales. In Testing and assessment in translation and interpreting studies, ed. Claudia V. Angelelli and Holly E. Jacobson, 95–121. Amsterdam: John Benjamins.
Google Scholar
Traub, Ross, and Glenn L. Rowley. 1991. An NCME instructional module: Understanding reliability. Educational Measurement: Issues and Practices 10 (1): 37–45.
Article Google Scholar
van Weeren, J., and T.J.J.M. Theunissen. 1987. Testing pronunciation: An application of generalizability theory. Language Learning 37 (1): 109–122.
Article Google Scholar
Wang, Weiwei, Xu Yi, Wang Binghua, and Mu Lei. 2020. Developing interpreting competence scales in China. Frontiers in Psychology 11: 481. https://doi.org/10.3389/fpsyg.2020.00481.
Article Google Scholar
Webb, Noreen, and Richard J. Shavelson. 2005. Generalizability theory: Overview. In Encyclopedia of Statistics in Behavioral Science, ed. S. Everitt Brian and David C. Howell, 717–719. Chichester: Wiley.
Google Scholar
Weigle, Sara. 1998. Using FACETS to model rater training effects. Language Testing 15 (2): 263–287.
Article Google Scholar
Wen, Qian. 2019. A many-facet Rasch model validation study on business negotiation interpreting test. Foreign Languages in China 16 (3): 73–82.
Google Scholar
Wigglesworth, Gillian. 1993. Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3): 305–319.
Article Google Scholar
Wu, Shao-Chuan. 2010. Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. https://theses.ncl.ac.uk/jspui/handle/10443/1122. Accessed 15 Apr 2019.
Zhao, Nan, and Yanping Dong. 2013. Validation of a consecutive interpreting test based on multi-faceted Rasch model. Journal of PLA University of Foreign Languages 36 (1): 86–90.
Google Scholar

Download references

Acknowledgements

This work was supported by National Social Science Foundation (grant number: 18AYY004).

Author information

Authors and Affiliations

Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China
Chao Han

Authors

Chao Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Han .

Editor information

Editors and Affiliations

Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China
Jing Chen
Research Institute of Interpreting Studies, College of Foreign Languages and Cultures, Xiamen University, Xiamen, China
Chao Han

Appendix 1 The Analytic Rating Scale

Band/scoring criteria	Information completeness (InfoCom)	Fluency of delivery (FluDel)	Target language quality (TLQual)
Band 4 (Score range: 7–8)	A substantial amount of original messages delivered (i.e., > 90%), with a few number of deviations, inaccuracies, and minor/major omissions	Delivery on the whole fluent, containing a few disfluencies such as (un)filled pauses, long silence, fillers and/or excessive repairs	Target language idiomatic and on the whole correct, with only a few instances of unnatural expressions and grammatical errors
Band 3 (Score range: 5–6)	Majority of original messages delivered (i.e., 60–70%), with a small number of deviations, inaccuracies, and minor/major omissions	Delivery on the whole generally fluent, containing a small number of disfluencies	Target language generally idiomatic and on the whole mostly correct, with a small amount of instances of unnatural expressions and grammatical errors
Band 2 (Score range: 3–4)	About half of original messages delivered (i.e., 40–50%), with many instances of deviations, inaccuracies, and minor/major omissions	Delivery rather fluent. Acceptable, but with regular disfluencies	Target language to a certain degree both idiomatic and correct. Acceptable, but contains many instances of unnatural expressions and grammatical errors
Band 1 (Score range: 1–2)	A small portion of original messages delivered (i.e., < 30%), with frequent occurrences of deviations, inaccuracies, and minor/major omissions, to such a degree that listeners may doubt the integrity of renditions	Delivery lacks fluency. It is frequently hampered by disfluencies, to such a degree that they may impede comprehension	Target language stilted, lacking in idiomaticity, and containing frequent grammatical errors, to such a degree that it may impede comprehension

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Han, C. (2021). Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement. In: Chen, J., Han, C. (eds) Testing and Assessment of Interpreting. New Frontiers in Translation Studies. Springer, Singapore. https://doi.org/10.1007/978-981-15-8554-8_5

Download citation

DOI: https://doi.org/10.1007/978-981-15-8554-8_5
Published: 11 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8553-1
Online ISBN: 978-981-15-8554-8
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics

Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix 1 The Analytic Rating Scale

Appendix 1 The Analytic Rating Scale

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation