IRR enables researchers to quantify the degree of agreement in ratings among two or more raters in clinical ratings (e.g., Ventura et al. 1998). IRR aids resolution of issues of differential diagnoses and overdiagnosis or underdiagnosis of BD (e.g., Hirschfeld et al. 2003; Zimmerman et al. 2008). As there are no published guidelines on IRR practices, we describe four common features.
First, IRR raters are trained in diagnostic criteria and clinical ratings, including listening to and coding of interviews from previous research participants, live observation, and supervised co-interviews. Additional training may include meeting an agreement criterion for clinical competency before conducting interviews (e.g., Weinstock et al. 2016).
Second, an investigator may choose to hold regular consensus meetings over the course of data collection. The goal of consensus meetings is to confirm the diagnosis (or score) is accurate or record a new corrected diagnosis (or score) established through discussion. Consensus meetings in clinical research are not designed to be a reliability tool; however, they may serve the function of maintaining rater consistency and preventing rater drift over time (e.g., Miklowitz et al. 2003). Raters may correct their scores when they come to the conclusion they have made an error or inaccuracy, although if disagreement remains and is an earnest difference of opinion, then it is kept as such given consensus meetings are not intended to minimize discrepancies based on honest differences of opinion (e.g., Sachs et al. 2003; Weinstock et al. 2016). Consensus meetings can occur weekly, monthly, or at important time anchors, or not at all when deemed unnecessary. Attendance includes some combination of supervisor(s), independent rater(s), original interviewer, and staff (e.g., Kosten and Rounsaville 1992). If a relevant member is unable to attend, notes are taken for consideration (e.g., Ong et al. 2017). All of these common variations fall within accepted standards of practice.
Third, each rater is assigned a subset of recorded interviews sampled randomly, quasi-randomly or nonrandomly to rate blindly and independently (i.e., prior to any group discussions). The proportion of blind ratings conducted may vary anywhere from < 10 to 100%, although a larger subset is preferable. Some may choose to skip this step due to the absence of subfield norms requiring it or by practical constraints such as staff shortages.
Fourth, current norms for reporting IRR to date are brief. Most studies include a description of the interviewer(s) and independent rater(s), proportion of interviews reviewed, and IRR statistics such as Kappa (for categorical diagnoses) or Intraclass Correlation Coefficients (for continuous measures). Often there is little to no mention of whether consensus meetings occurred and, if noted, minimal details are provided. It is often not specified whether the reported statistics reflect pre-consensus (i.e., how much did raters agree beforehand; Weinstock et al. 2016) or post-consensus (i.e., how much did raters agree after the meeting; e.g., Ong et al. 2017). Reported IRR values are commonly high (e.g., Skre et al. 1991) given the SCID “skip out” structure that reduces opportunities for disagreement (e.g., Joormann and Gotlib 2007). Although it is beyond the scope of this letter to provide a definitive conclusion for what constitutes acceptable IRR, we note that relevant commentaries have been provided elsewhere, suggesting variability in acceptable value ranges. For example, while some researchers consider kappas above 0.70 to indicate good agreement, others propose a lower goal of k = 0.40–0.60, but state that values as low as 0.20–0.40 are acceptable for psychiatric diagnoses (cf. Spitzer et al. 2012).