INTRODUCTION

Scientific meeting abstract review is susceptible to poor inter-rater agreement, which can lead to decreased differentiation among abstracts. A rubric is “a scoring guide…with three essential features: evaluative criteria, quality definitions, and a scoring strategy.”1 Abstract review guided by a detailed rubric could improve inter-rater reliability and lead to presentation of higher quality abstracts.

The 1991 Society of General Internal Medicine (SGIM) scientific abstract committee analyzed inter-rater agreement.2 At that time, there were three criteria: interest to SGIM audience, quality of methods, and quality of presentation. Score options were as follows: 1= poor, 2 = fair, 3 = good, 4 = very good, and 5 = outstanding. Given significant reviewer disagreement, the authors suggested a 7-point scoring scale with explicit descriptions of the scores.

By 2016, there were four criteria, with sparse instructions (“1, lowest; 7, highest”). In 2017, a large-scale rubric modification was initiated, retaining four review criteria (Importance, Methods, Conclusions, and Writing), but adding detailed descriptions for each score on the 7-point scale within each criterion (see Text Box 1). We examined whether the 2017 rubric addressed scoring issues including leniency bias (abstract mean scores), inter-rater reliability (within-abstract standard deviations), and discriminability of abstracts (across-abstract standard deviations).

Importance of the Research Question [Importance]: To what extent does the abstract address a topic that is important? To what degree will the results advance concepts in General Internal Medicine?

1

2

3

4

5

6

7

Does not address a topic important to general internists.

Addresses a topic important to only a few general internists.

Addresses a topic important to some general internists.

Addresses a topic important to about half of general internists.

Addresses a topic that is important to many general internists; or somewhat expands current concepts.

Addresses a topic that is important to most general internists; or greatly expands current concepts.

Addresses a topic that is important to nearly all general internists; or introduces a new concept.

Strength and Appropriateness of Methods [Methods]: Is the study design clearly described? Are sampling procedures adequately described, including inclusion and exclusion criteria; is there potential selection bias? Are the measures reliable and valid? Are possible confounding factors addressed? Are the statistical analyses appropriate for the study design, and are they the best that could have been used? Is there discussion of the statistical power?

[Please note that not all issues described apply to all abstract types. For example, qualitative studies may not have statistical analyses; however, they should still be evaluated on the quality of study design description and appropriateness of the methods.]

1

2

3

4

5

6

7

Study design and sampling procedures not described. Possible confounders not discussed. Statistical analyses are not discussed.

Study design and sampling procedures poorly described. Possible confounders not discussed.

Study design and sampling procedures adequately described. Possible confounders not discussed. Statistical analyses are adequate.

Study design and sampling procedures fully described. Measures are probably reliable and valid. Possible confounders partially discussed, but may not be controlled. Statistical analyses are appropriate.

Study design and sampling procedures fully described. No selection bias exists. Measures probably reliable and valid. Possible confounders fully discussed and controlled for as needed. Statistical analyses are appropriate.

Study design and sampling procedures well described. No selection bias exists. Measures are reliable and valid. Possible confounders fully discussed and controlled for as needed. Statistical analyses are strong.

Study design and sampling procedures very clearly described. No selection bias exists. Measures are reliable and valid. Possible confounders fully discussed and controlled for as needed. Statistical analyses are the best that could have been used.

Validity of Conclusions and Implications [Conclusions]: Are conclusions clearly stated and justified by the data? Are implications strong enough to influence how clinicians/teachers/researchers “act” in clinical practice, teaching, or future research?

1

2

3

4

5

6

7

Conclusions and implications not included.

Does not influence action.

Conclusions present but not justified.

Does not influence action.

Conclusions present and weakly supported.

Provides knowledge but likely will not change action.

Conclusions clearly stated and supported.

Absent or weak implications.

Provides knowledge but likely will not change action.

Conclusions clearly stated and supported.

Implications weak.

Provides knowledge that may change action.

Conclusions clearly stated and supported.

Implications moderately appropriate.

Provides knowledge that may change action.

Conclusions clearly stated and supported.

Implications fully appropriate.

Provides knowledge that likely will change action.

Quality of Writing [Writing]: Is the writing clear and organized to effectively communicate the findings?

1

2

3

4

5

6

7

Writing is poor and disorganized.

Writing is adequate and somewhat disorganized.

Writing is adequate and minimally disorganized.

Writing is clear and organized.

Writing is above average and organized.

Writing is high quality and well organized.

Writing is masterful and well organized.

METHODS

We analyzed all abstracts submitted from 2014 to 2018, with 2014–2016 designated as “old” and 2017–2018 as “new” rubric periods. We calculated the composite score for each abstract-reviewer combination as the mean of the four individual criteria scores (Importance, Methods, Conclusions, and Writing) provided by a reviewer for a given abstract. We calculated the final score for each abstract as the unweighted mean of the composite scores from all submitted reviews for that abstract.

All analyses compared “old” to “new” rubric abstracts. First, we calculated the mean composite score per abstract (i.e., final score) and the standard deviations (SDs) of the composite scores for a given abstract. These are within-abstract statistics, reflecting the distribution of composite scores across reviews within each abstract. For each within-abstract statistic, we took a weighted mean of the statistic in the old and new rubric periods, using the number of reviews as the weighting factor. Then, we calculated the old to new ratio of the weighted mean of the statistic. To test the hypotheses that the new rubric would (1) decrease scores (i.e., reduce leniency), (2) increase inter-rater reliability, and (3) cause reviewers to use more of the scoring range across abstracts, we calculated the old to new ratio of (1) weighted mean final scores, (2) weighted mean of within-abstract SDs for composite scores, and (3) across-abstract SDs for final scores, respectively.

We used approximate permutation to estimate the sampling distribution of old to new ratios under the null hypothesis that the rubric had no effect.3 We used sampling with replacement by drawing 1000 samples of 3523 abstracts from the original sample of 3523 abstracts, randomly allocating 2078 as “old” and 1445 as “new” rubric, based on the original ratio of abstracts. We calculated the old to new ratio for each statistic of interest. If the observed old to new ratio falls outside the range of ratios calculated from the 1000 random samples, the null hypothesis can be rejected.

RESULTS

During the study period, 3523 abstracts were submitted, 2078 in the old period and 1445 in the new period. The effect of the 2017 rubric on composite scores is shown in Table 1. The weighted mean final scores in new rubric years were significantly lower than those in old rubric years. Weighted mean within-abstract SDs of composite scores similarly show statistically significant decreases in new rubric years. Final score SDs across abstracts indicated no statistically significant change.

Table 1 Effect of Rubric on Composite Scores

DISCUSSION

Our new rubric successfully lowered final scores on scientific abstracts, reflecting a shift away from leniency bias (i.e., tendency toward the upper portion of a scoring range). The rubric also decreased the composite score SDs within abstracts, indicating improvement in inter-rater agreement. The rubric did not lead to more variable scores overall across all abstracts; however, scores did shift toward the lower end of the scoring range, such that fewer abstracts received high scores and more received low scores.

Objective evaluation of abstract submissions ensures the rigor of scientific meeting presentations. Efforts should continue to refine and implement tools to improve abstract scoring and maintain a high-integrity environment for disseminating scientific discovery.