Skip to main content

A novel rater agreement methodology for language transcriptions: evidence from a nonhuman speaker

Abstract

The ability to measure agreement between two independent observers is vital to any observational study. We use a unique situation, the calculation of inter-rater reliability for transcriptions of a parrot’s speech, to present a novel method of dealing with inter-rater reliability which we believe can be applied to situations in which speech from human subjects may be difficult to transcribe. Challenges encountered included (1) a sparse original agreement matrix which yielded an omnibus measure of inter-rater reliability, (2) “lopsided” \(2\times 2\) matrices (i.e. subsets) from the overall matrix and (3) categories used by the transcribers which could not be pre-determined. Our novel approach involved calculating reliability on two levels—that of the corpus and that of the above mentioned smaller subsets of data. Specifically, the technique included the “reverse engineering” of categories, the use of a “null” category when one rater observed a behavior and the other did not, and the use of Fisher’s Exact Test to calculate \(r\)-equivalent for the smaller paired subset comparisons. We hope this technique will be useful to those working in similar situations where speech may be difficult to transcribe, such as with small children.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    Theoretically, in a case such as this, the 57 % agreement can be dramatically inflated. See Table 1 for an example scenario.

References

  1. Brennan, R.L., Light, R.J.: Measuring agreement when two observers classify people into categories not defined in advance. Br. J. Math. Stat. Psychol. 27, 154–163 (1974)

    Article  Google Scholar 

  2. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46 (1960)

    Article  Google Scholar 

  3. Colbert-White, E.N., Covington, M.A., Fragaszy, D.M.: Social context influences the vocalizations of a home-raised African grey parrot (Psittacus erithacus erithacus). J. Comp. Psychol. 125, 175–184 (2011). doi:10.1037/a0022097

    Article  Google Scholar 

  4. Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1941)

    Google Scholar 

  5. Van Geert, P., Van Dijk, M.: Ambiguity in child language: the problem of interobserver reliability in ambiguous observation data. First Lang. 23, 259–284 (2003)

    Article  Google Scholar 

  6. Hubert, L.: Nominal scale response agreement as a generalized correlation. Br. J. Math. Stat. Psychol. 30, 98–103 (1977)

    Article  Google Scholar 

  7. Kaufman, A.B., Rosenthal, R.: Can you believe my eyes? The importance of interobserver reliability statistics in observations of animal behaviour. Anim. Behav. 78, 1487–1491 (2009)

    Article  Google Scholar 

  8. Krippendorff, K.: Reliability of binary attribute data. Biometrics 34, 142–144 (1978)

    Google Scholar 

  9. Lindsay, J., O’Connell, D.C.: How do transcribers deal with audio recordings of spoken discourse? J. Psycholinguist. Res. 24, 101–115 (1995)

    Article  Google Scholar 

  10. Montgomery, A.C., Crittenden, K.S.: Improving coding reliability for open-ended questions. Public Opin. Q. 41, 235–243 (1977)

    Article  Google Scholar 

  11. Popping, R.: Traces of agreement: on the DOT-product as a coefficient of agreement. Qual. Quant. 17, 1–18 (1983)

    Article  Google Scholar 

  12. Popping, R.: Traces of agreement: on some agreement indices for open-ended questions. Qual. Quant. 18, 147–158 (1984)

    Article  Google Scholar 

  13. Rosenthal, R.: Conducting judgment studies: some methodological issues. In: Harrigan, J., Rosenthal, R., Scherer, K. (eds.) The new handbook of methods in nonverbal behavior research, pp. 199–236. Oxford University Press, New York (2005)

    Google Scholar 

  14. Rosenthal, R., Rubin, D.B.: A simple, general purpose display of magnitude of experimental effect. J. Educ. Psychol. 74, 166–169 (1982)

    Article  Google Scholar 

  15. Rosenthal, R., Rubin, D.B.: r-equivalent: a simple effect size indicator. Psychol. Methods 8, 492–496 (2003)

    Article  Google Scholar 

  16. Rosenthal, R., Rosnow, R.: Essentials of behavioral research: methods and data analysis. McGraw-Hill, New York (2008)

    Google Scholar 

  17. Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 17, 321–325 (1955)

    Article  Google Scholar 

  18. Siegel, S.: Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York (1956)

    Google Scholar 

  19. Snedecor, G.W., Cochran, W.G.: Statistical methods. Iowa State University Press, Ames (1989)

    Google Scholar 

  20. Stockman, I.: Listener reliability in assigning utterance boundaries in children’s spontaneous speech. Appl. Psycholinguist. 31, 363–395 (2010)

    Article  Google Scholar 

  21. Tinsley, H.E.A., Weiss, D.J.: Interrater reliability and agreement of subjective judgments. J. Couns. Psychol. 22, 358–376 (1975)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Allison B. Kaufman.

Additional information

Allison B. Kaufman is now in the Department of Ecology and Evolutionary Biology at The University of Connecticut.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kaufman, A.B., Colbert-White, E.N. & Rosenthal, R. A novel rater agreement methodology for language transcriptions: evidence from a nonhuman speaker. Qual Quant 48, 2329–2339 (2014). https://doi.org/10.1007/s11135-013-9894-5

Download citation

Keywords

  • Inter-rater reliability
  • Rater agreement
  • Fisher’s Exact Test
  • \(r\)-Equivalent
  • Sparse agreement matrix
  • Speech transcription