Improving the Reliability of Peer Review Without a Gold Standard

Peer review plays a crucial role in accreditation and credentialing processes as it can identify outliers and foster a peer learning approach, facilitating error analysis and knowledge sharing. However, traditional peer review methods may fall short in effectively addressing the interpretive variability among reviewing and primary reading radiologists, hindering scalability and effectiveness. Reducing this variability is key to enhancing the reliability of results and instilling confidence in the review process. In this paper, we propose a novel statistical approach called “Bayesian Inter-Reviewer Agreement Rate” (BIRAR) that integrates radiologist variability. By doing so, BIRAR aims to enhance the accuracy and consistency of peer review assessments, providing physicians involved in quality improvement and peer learning programs with valuable and reliable insights. A computer simulation was designed to assign predefined interpretive error rates to hypothetical interpreting and peer-reviewing radiologists. The Monte Carlo simulation then sampled (100 samples per experiment) the data that would be generated by peer reviews. The performances of BIRAR and four other peer review methods for measuring interpretive error rates were then evaluated, including a method that uses a gold standard diagnosis. Application of the BIRAR method resulted in 93% and 79% higher relative accuracy and 43% and 66% lower relative variability, compared to “Single/Standard” and “Majority Panel” peer review methods, respectively. Accuracy was defined by the median difference of Monte Carlo simulations between measured and pre-defined “actual” interpretive error rates. Variability was defined by the 95% CI around the median difference of Monte Carlo simulations between measured and pre-defined “actual” interpretive error rates. BIRAR is a practical and scalable peer review method that produces more accurate and less variable assessments of interpretive quality by accounting for variability within the group’s radiologists, implicitly applying a standard derived from the level of consensus within the group across various types of interpretive findings. Supplementary Information The online version contains supplementary material available at 10.1007/s10278-024-00971-9.


Notations
Let us assume there are N grades different severity grade levels.Let us define the set of G severity grade levels as G = {0, 1, 2, N grades − 1}.
(1) The discrepancy value is defined in terms of o ∈ G (grade given by QA'd Radiologist) and r ∈ G (grade given by QA'ing Radiologist) as follows (2) That is, the discrepancy value as defined in Equation ( 2) captures the discrepancy of o with respect to r.
Let there be N studies and N reviewers studies and reviewers, respectively.

Generative model Definition
Next, we formulate a hierarchical model for o as follows where α QA'd is a user-defined parameter.
Similarly, we formulate a hierarchical model for r as follows p QA'ing |α QA'ing ∼ Dirichlet(α QA'ing ), where α QA'ing is a user-defined parameter.
We consider N grades patient profiles.Therefore, we formulate a hierarchical model for patient profile, s ∈ 0, 1, . . ., N grades − 1, as follows where p S is a user-defined parameter.
Next, by combining the previous models, we can formulate a hierarchical model for y as follows Note that m is used to permutate the order of QA'ing Radiologists for each study separately.This is needed so that in the case of N reviews less than the number of QA'ing Radiologists, the studies are not always reviewed by the same subset of QA'ing Radiologists.

Graphical model
The model defined in Equation ( 6) is visualized in Figure 1.

Data simulation
The model defined in Equation ( 6) is used to simulate QA review data.The rates of diagnostic errors defined for the QA'd Radiologist and the relative rates that QA'd Exams will have true pathology grades 0, 1, and 2, were roughly informed by empirical data observed by the authors in active QA programs, but are effectively arbitrary choices that enable the relative performance of the five diagnostic error rate measurement methodologies to be compared to each other.In the following sections, we describe in detail how we calibrated the generative model for simulating data.6) is visualized using the plate notation.The grey and white circles denote observed and latent variables, respectively.The solid black circles denote user-definable parameters.The plates denote repetitions.The directed edges indicate dependencies between variables.

Calibration of QA'd exams
The probabilities that QA'd Exams will have true pathology grades equal to 0, 1, and 2, respectively, were calibrated using the three probabilities defined in the vector p s as follows The vector p s can be interpreted as "50% of the QA'd Exams have pathology of grade 0, 30% have pathology of grade 1, and 20% have pathology of grade 2."

Calibration of interpreting radiologists
The predefined probabilities that patient exams with true pathologies of specific grades would be interpreted as the correct grade or incorrectly as one of the other grades, are modeled as Dirichlet distributions, which are calibrated using the following pattern of correct and incorrect diagnoses defined in the matrix A QA'd as follows Where the first row in A QA'd can be interpreted as, "out of 86 exams in which the true grading of the patient's pathology is '0', the QA'd Radiologist will correctly grade the exam as '0' 80 times, incorrectly grade the exam as grade '1' 4 times, and incorrectly grade the exam as '2' 2 times."The second row in A QA'd can be interpreted as, "out of 21 exams in which the true grading of the patient's pathology is '1', the QA'd Radiologist will correctly grade the exam as '1' 15 times, incorrectly grade the exam as grade '0' 3 times, and incorrectly grade the exam as '2' 3 times."The third row in A QA'd can be interpreted as, "out of 26 exams in which the true grading of the patient's pathology is '2', the QA'd Radiologist will correctly grade the exam as '2' 20 times, incorrectly grade the exam as grade '0' 2 times, and incorrectly grade the exam as '1' 4 times." The overall interpretive error rate assigned to each of the QA'd radiologists is predefined to be 17% (i.e. in 83% of exams, the QA'd radiologist will grade the pathology correctly).Further, the QA'd radiologists' rate of "two-degree errors", which are defined to be errors where a grade 0 pathology is diagnosed as grade 2 or vice versa, is 3% (i.e.97% of exams, the QA'd radiologist will grade the pathology in an exam correctly or be just one degree off).

Calibration of reviewing radiologists
The panel of reviewing radiologists participating in the simulated QA program were assigned predefined probabilities to make interpretive errors in the same manner as the QA'd radiologists, however they were defined to have three different profiles with respect to the probabilities that they would correctly detect and grade the pathology of interest in the secondary QA reviews.These probabilities were modeled in the same manner as the QA'd Radiologists, described above, and were calibrated using the patterns of correct and incorrect diagnoses defined in the following three matrices A QA'ing,1 , A QA'ing,2 , and A QA'ing,3 as follows and Inspection of these three calibration matrices reveals that Reviewing Radiologist Profiles 1 (defined by A QA'ing,1 ), 2 (defined by A QA'ing,2 ), and 3 (defined by A QA'ing,3 ), are defined to have lower, equal, and higher probabilities of errors, respectively, compared to what was defined for the QA'd Radiologists above.Given p s as defined above, the diagnostic error rates of the Reviewing Radiologist with assigned Profile 1 had an overall interpretive error rate of 13% and reviewing radiologists assigned Profile 3 had an overall interpretive error rate of 22%, respectively.

Generative model
We use a simplified version of the model defined in Equation ( 6) as follows Note that in the model defined in Equation ( 12) we do not have the hierarchical priors for p QA'd k and p QA'ing k as in Equation ( 6) , and additionally, p QA'ing k is shared across QA'ing Radiologists.That is, we will estimate a single REPDM; however, we could estimate REDPM for each QA'ing Radiologist separately.
The posterior probability density function of the model defined Equation ( 12) is proportional to the joint probability density function where Note that the discrete variables s i , i = 1, 2, . . ., N studies are marginalized out.

QA'ing Radiologists' error detection probability
We can state the probability of s given o and r under the model defined in Equation ( 12) using the Bayes' theorem as follows These probabilities quantify our belief in the grade levels, r, given by QA'ing Radiologists.
For instance, let then Using a similar approach as in Equation ( 14), we can state the probabilities p ((s, f (o, s)) | (r, f (o, r))).If the discrepancy function f (x, •) for some x is a many-to-one function, i.e. ∃x ∈ G ∃y, z ∈ G (f (x, y) = f (x, z) ∧ y ̸ = z), then we might want to aggregate over those equivalence classes as our focus is on the pairs made of a severity grade and a discrepancy value.These probabilities quantify our belief in the grade levels and discrepancy values, (r, f (o, r)) , given by QA'ing Radiologists.
Let us consider the same scenario as in Equation ( 15).Next, we simulate data and study the Kullback-Leibler divergence of the true distribution from the estimated distribution as a function of number of studies and number of reviews.The results are shown in Figure 2.

Estimation of error rates
Without a correction for imperfect QA'ing Radiologists If N reviews = 1 for all i = 1, 2, . . ., N studies , then the maximum likelihood estimates (MLEs) of the rates of discrepancies under the categorical model without a cor-rection for imperfect QA'ing Radiologists given a severity grade, g ∈ G, are (17) where With a correction for imperfect QA'ing Radiologists If N reviews = 1 for all i = 1, 2, . . ., N studies , then the maximum likelihood estimates (MLEs) of the rates of discrepancies under the categorical model with a correction for imperfect QA'ing Radiologists given a severity grade, g ∈ G, are

Figure 1 :
Figure 1: The generative model defined Equation (6) is visualized using the plate notation.The grey and white circles denote observed and latent variables, respectively.The solid black circles denote user-definable parameters.The plates denote repetitions.The directed edges indicate dependencies between variables.

Figure 2 :
Figure 2: Kullback-Leibler divergence (KLD) is studied as a function of number of studies and number of reviews.Each boxplot is estimated based 100 simulations.