On the choice of measures of reliability and validity in the content-analysis of texts


The paper discusses several reliability measures: Scott’s pi, Krippendorff’s alpha, free marginal adjustment (Bennett, Alpert and Goldstein’s \(S\)), Cohen’s kappa, and Perreault and Leigh’s \(I\) and the assumptions on which they are based. It is suggested that correlation coefficients between, on one hand, the distribution of qualitative codes and, on the other hand, word co-occurrences and the distribution of the categories identified with the help of the dictionary based on substitution complement the other reliability measures. The paper shows that the choice of the reliability measure depends on the format of the text (stylistic versus rhetorical) and the type of reading (comprehension versus interpretation). Namely, Cohen’s kappa and Bennett, Alpert and Goldstein’s \(S\) emerge as reliability measures particularly suited for perspectival reading of rhetorical texts. Outcomes of the content analysis of 57 texts performed by four coders with the help of computer program QDA Miner inform the analysis.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    A similar assumption underpins the use of Cronbach’s \(\alpha \) in the cultural consensus theory. The agreement between coders presumably depends on how well they know the content of a cultural domain that exists independently of their input (Weller 2007, p. 343).

  2. 2.

    This assumption also serves to minimize the influence of the coders’ values on the outcomes of content analysis. If the content analysis is not value-free, then the coders have fewer chances to agree on the distribution of the categories. The possibility theorem, which is applicable to choices guided by values, states that “for any method of deriving social choices by aggregating individual preference patterns which satisfies certain natural conditions, it is possible to find individual preference patterns which give rise to a social choice pattern which is not linear ordering” (Arrow 1950, p. 330).

  3. 3.

    This rationale should not be confused with another, more “positivist” argument advanced by Krippendorff (2004a, p. 249): “we must estimate the distribution of categories in the population of phenomena from the judgments of as many observers as possible (at least two), making the common assumption that observer differences wash out in their average”.

  4. 4.

    Being a recent university graduate, the fourth co-author has not produced enough publications yet. She played the role of a “perfect reader” whose take on a text is not affected by the authorship of the other texts included in the sample.

  5. 5.

    When assessing this level of inter-coder agreement, one has to bear in mind that it reflects both the reliability of unitizing and the reliability of coding.

  6. 6.

    The dictionary based on substitution was subject to small edits only at the third stage.

  7. 7.

    The code book for analyzing the first co-author’s texts contained 13 codes, in the case of the second co-author it contained 15 codes, and in the case of the third—nine codes.

  8. 8.

    The distribution of the reliability measures was visually inspected prior to correlation analysis. This “eyeballing” suggested that the normality of distribution condition was not significantly violated.


  1. Arrow, K.J.: A difficulty in the concept of social welfare. J. Polit. Econ. 58(4), 328–346 (1950)

    Article  Google Scholar 

  2. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistic. Comput. Linguist. 34(4), 555–596 (2008)

    Article  Google Scholar 

  3. Bennett, E., Alpert, R., Goldstein, A.C.: Communications through limited-response questioning. Public Opin. Quart. 18(3), 303–308 (1954)

    Article  Google Scholar 

  4. Bryman, A., Bell, E., Teevan, J.J.: Social Research Methods, 3rd edn. Oxford University Press, Don Mills (2012)

    Google Scholar 

  5. Camp, S.D., Saylor, W.G., Harer, M.D.: Aggregating individual-level evaluations of the organizational social climate: a multilevel investigation of the work environment at the Federal bureau of prisons. Justice Q. 14(4), 739–762 (1997)

    Article  Google Scholar 

  6. Dijkstra, L., van Eijnatten, F.M.: Agreement and consensus in a Q-mode research design: an empirical comparison of measures, and an application. Qual. Quant. 43(5), 757–771 (2009)

    Article  Google Scholar 

  7. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)

    Article  Google Scholar 

  8. Krippendorff, K.: Content Analysis: An Introduction to Its Methodology. SAGE, Thousand Oaks (2004a)

    Google Scholar 

  9. Krippendorff, K.: Measuring the reliability of qualitative text analysis data. Qual. Quant. 38(6), 787–800 (2004b)

    Article  Google Scholar 

  10. Lotman, Y.: Universe of the Mind: A Semiotic Theory of Culture. Indiana University Press, Bloomington (1990)

    Google Scholar 

  11. Muñoz-Leiva, F., Montoro-Ríos, F.J., Luque-Martínez, T.: Assessment of interjudge reliability in the open-ended questions coding process. Qual. Quant. 40(4), 519–537 (2006)

    Article  Google Scholar 

  12. Neuendorf, K.A.: The Content Analysis Guidebook. SAGE, Thousand Oaks (2002)

    Google Scholar 

  13. Norris, S.P., Philips, L.M.: The relevance of a reader’s knowledge within a perspectival view of reading. J. Read. Behav. 26(4), 391–412 (1994)

    Google Scholar 

  14. Oleinik, A.: Mixing quantitative and qualitative content analysis: triangulation at work. Qual. Quant. 45(4), 859–873 (2010)

    Article  Google Scholar 

  15. Oleinik, A., Kirdina S., Popova I., Shatalova T.: Kak uchenye chitayut drug druga: osnova teorii akademicheskogo chteniya [How scientists read: on a theory of academic reading]. SOCIS 8 (2013)

  16. Perreault, W.D., Leigh, L.E.: Reliability of nominal data based on qualitative judgments. J. Mark. Res. 26(2), 135–148 (1989)

    Article  Google Scholar 

  17. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York (1983)

    Google Scholar 

  18. Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)

    Article  Google Scholar 

  19. Siegel, S., Castellan, N.J.: Nonparametric Statistics for the Behavioural Sciences. McGraw Hill, New York (1988)

    Google Scholar 

  20. Skinner, Q.: Visions of Politics. Cambridge University Press, Cambridge (2002)

    Google Scholar 

  21. Warner, R.M.: Applied Statistics. SAGE, Thousand Oaks (2008)

    Google Scholar 

  22. Weller, S.C.: Cultural consensus theory: applications and frequently asked questions. Field Methods 19(4), 339–368 (2007)

    Article  Google Scholar 

Download references


The authors would like to thank the anonymous reviewers of Quality & Quantity for their helpful and constructive suggestions and comments. However, all remaining errors and inaccuracies are solely attributable to the authors.

Author information



Corresponding author

Correspondence to Anton Oleinik.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Oleinik, A., Popova, I., Kirdina, S. et al. On the choice of measures of reliability and validity in the content-analysis of texts. Qual Quant 48, 2703–2718 (2014). https://doi.org/10.1007/s11135-013-9919-0

Download citation


  • Reliability measures
  • Content analysis
  • Correlation analysis
  • Interpretation
  • Comprehension
  • Stylistic texts
  • Rhetorical texts