Validating a forced-choice method for eliciting quality-of-reasoning judgments

Marcoci, Alexandru; Webb, Margaret E.; Rowe, Luke; Barnett, Ashley; Primoratz, Tamar; Kruger, Ariel; Karvetski, Christopher W.; Stone, Benjamin; Diamond, Michael L.; Saletta, Morgan; van Gelder, Tim; Tetlock, Philip E.; Dennis, Simon

doi:10.3758/s13428-023-02234-x

Validating a forced-choice method for eliciting quality-of-reasoning judgments

Original Manuscript
Open access
Published: 13 October 2023

Volume 56, pages 4958–4973, (2024)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Validating a forced-choice method for eliciting quality-of-reasoning judgments

Download PDF

Alexandru Marcoci¹,
Margaret E. Webb²,
Luke Rowe³,
Ashley Barnett⁴,
Tamar Primoratz⁴,
Ariel Kruger⁴,
Christopher W. Karvetski⁵,
Benjamin Stone²,
Michael L. Diamond²,
Morgan Saletta⁴,
Tim van Gelder⁴,
Philip E. Tetlock⁶ &
…
Simon Dennis²

1139 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper we investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions. Across two studies, novices and experts assessing quality of reasoning through a forced-choice design were both able to choose arguments supporting more accurate solutions—62.2% (SE = 1%) of the time for novices and 74.4% (SE = 1%) for experts—and arguments produced by larger teams—up to 82% of the time for novices and 85% for experts—with high inter-rater reliability, namely 70.58% (95% CI = 1.18) agreement for novices and 80.98% (95% CI = 2.26) for experts. We also explored two methods for increasing efficiency. We found that the number of comparative judgments needed could be substantially reduced with little accuracy loss by leveraging transitivity and producing quality-of-reasoning assessments using an AVL tree method. Moreover, a regression model trained to predict scores based on automatically derived linguistic features of participants’ judgments achieved a high correlation with the objective accuracy scores of the arguments in our dataset. Despite the inherent subjectivity involved in evaluating differing quality of reasoning, the forced-choice paradigm allows even novice raters to perform beyond chance and can provide a valid, reliable, and efficient method for producing quality-of-reasoning assessments at scale.

A normative framework for argument quality: argumentation schemes with a Bayesian foundation

Article 22 July 2015

The Elusive Notion of “Argument Quality”

Article 17 November 2017

Laypeople’s Evaluation of Arguments: Are Criteria for Argument Quality Scheme-Specific?

Article Open access 13 February 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

When eliciting judgments about an unknown quantity, such as the quality of a written argument, one can prompt participants either to directly score an item (cardinal measurement) or to make a comparative judgment (ordinal measurement). Cardinal measurements have been extensively employed in measuring quality of reasoning and argumentation, usually supported by the use of a rubric (Jonsson & Svingby, 2007; Brookhart & Chen, 2015). However, scoring argument quality is time-consuming and subject to various cognitive biases, leading to low inter-rater reliability (e.g., Wachsmuth et al., 2017, Toledo et al., 2019, Gretz et al., 2020). In contrast, ordinal measurements are faster and less cognitively demanding on human raters, reducing the risk of bias and variance (Toledo et al., 2019, Gleize et al., 2019). However, they force raters to collapse the multiple relevant dimensions on which two written texts often fare differently (for instance, Wachsmuth et al., 2017, found 15 different categories relevant for measuring quality of reasoning) to a coarse binary choice. Moreover, ordinal measurements require significantly more (monotonous) judgments to be made (Bramley et al., 1998) leading Verhavert et al. (2018) to state that “one of the most important methodological questions in CJ [comparative judgments] to date is, how can the efficiency (in number of comparisons) of a CJ assessment be increased without affecting the reliability of the final estimates?” (p. 429).

The aim of the current research was to investigate the criterion validity of forced-choice comparisons of the quality of written arguments with normative solutions and explore strategies for producing more efficient comparisons.

The two studies we report on below were conducted as part of IARPA’s Crowdsourcing Evidence, Argumentation, Thinking and Evaluation (CREATE) program.^{Footnote 1} The CREATE program aimed to develop tools facilitating groups of intelligence analysts to write better-reasoned reports. Within CREATE, the Smartly-assembled wiki-style argument marshalling (SWARM) project^{Footnote 2} (which included AM, MEW, LR, AB, TP, AK, BS, MLD, MS, TvG, and SD) focused on measuring the gains in quality of reasoning brought about by structured writing techniques modeled after the Delphi method as compared to unstructured methods for collaborating. SWARM constructed a corpus of 279 arguments in support of answers to a wide range of reasoning problems with normatively correct solutions (Study 1, Methods section). We instructed participants to choose the better-reasoned rationale out of pairs of these arguments. Study 1 used an MTurk sample, and Study 2 used an expert sample, composed of people with relevant expertise in judging reasoning. Criterion validity assesses whether a measure is positively related to other measures one would expect it to be related to. We investigated the extent to which forced-choice judgments tracked accuracy, team size, and expertise.

We first expected that normatively correct answers would be accompanied by better arguments. Indeed, this is the underlying assumption of deliberating groups as diverse as juries and scientific collaborations. We argue with one another because we expect that “some arguments must be better than others and ‘argument strength’ must have some meaningful connection with truth” (Hahn, 2020), at least when we have all the relevant evidence. Most reasoning tasks included in this study (see Table 1) provided all information required to solve them in their statement. Additionally, participants solved them in groups, allowing them to share hidden and undistributed information and to scrutinize each other’s reasoning, thus improving their prospects of reaching the correct solution.

Table 1 Description of problems included in Study 1

Full size table

Second, we expected larger teams to produce answers that were more accurate and better reasoned. Group performance usually improves with increasing group size, especially for problems of moderate difficulty that require understanding of verbal, quantitative, or logical conceptual systems (Laughlin et al., 2002, 2006; Woolley et al., 2010, Trouche et al., 2014). For example, Kosinski et al. (2012) showed that the probability of finding solutions to cognitively complex problems was logarithmically related to the number of group member responses—findings which were replicated by Vercammen et al. (2019). Moreover, structuring group interaction (using a Delphi protocol for instance) is also shown to further improve the accuracy of group judgments (O’Hagan, 2019) and to counter common cognitive biases. We assembled teams ranging from 5 to 21 members. While we did not mandate a minimum level of participation and we observed many idle participants in most teams, we nevertheless expected that, everything else being equal, larger teams would have more active members and produce more accurate answers and better rationales.

Finally, we expected the correlations between objective accuracy and quality of reasoning to be stronger for experts than for novices. Expertise cannot simply be reduced to credentials (Burgman, 2016). It requires intensive training (Ericsson, 2006) and deliberate practice (Ericsson & Lehmann, 1996), and it needs to be elicited in a structured way (Burgman et al., 2011). Our expert sample included individuals with research and teaching expertise in logic and critical thinking who had extensive experience marking student assignments, and we elicited their judgments in a structured way.

Study 1: Assessing forced choice using novice raters

In Study 1, we measured criterion validity by assessing whether accuracy and team size affected whether a rationale was selected as better reasoned through a forced-choice design. We pre-registered our hypotheses on the Open Science Framework (see https://osf.io/re5ha) using the pre-registration template provided by AsPredicted.org (https://osf.io/m3spx/). We hypothesized that (1) products resulting in more accurate solutions would be associated with rationales that were chosen more often in forced-choice comparisons; and (2) teams with larger numbers of individuals would produce better-justified rationales than teams with smaller numbers.^{Footnote 3}

Participants

MTurk raters (N = 218) completed the Human Intelligence Tasks (HITS)^{Footnote 4} at the rate of USD 10/hr. Each pair of rationales was evaluated by exactly three raters.

Materials

Rationales were produced by teams in the SWARM project. An email invitation was sent to 4179 members of our research pool (van Gelder et al., 2020), of which N = 233 consented to participate. They were assigned to teams of varying sizes in two production protocols, and in the end we assembled: four teams of five people, six teams of 10 people, four teams of 15 people, and four teams of 21 people, split evenly across protocols. Participants were given 48 hours (in February–March 2019) to solve 19 problems (Table 1). Two problems, however, were later removed from the dataset as they were mistakenly presented to groups twice (e.g., Logical reasoning 1 and 2 were the same, Raven’s matrices 1 and 2 were the same). The final dataset of problems was based on 17 unique items in 12 different problem categories (Table 1, columns 1 and 2). These were selected to establish a comprehensive sample of different types of collective reasoning tasks that could be completed in a group context. Our item-sampling procedure was guided by prior research that had validated this approach in measuring the general reasoning ability of human groups (see Engel et al., 2014; Riedl et al., 2021; Woolley et al., 2010). These studies drew heavily upon McGrath’s task circumplex, an established group task taxonomy originating from social and organizational psychology, to sample a comprehensive set of tasks based on four qualitatively distinct group processes: generate (create and plan together), choose (analyze and decide together), negotiate (resolve conflicts and competing priorities together), and execute (compete and perform together) (see McGrath, 1984, p. 61). Figure 1 displays an adaptation of McGrath’s group task taxonomy and Table 1 provides specific connections to the item–quadrant combinations we aimed toward; however, we acknowledge that these distinctions are not easily resolved, and overlap from one quadrant to another is inevitable.

Note that some quadrants are more heavily sampled than others as a matter of convenience and context. For example, some of McGrath’s group processes were more easily adapted to our present study context, such as those related to the “judgment” processes, based on the available time and asynchronous communication constraints. The asymmetrical sampling of McGrath’s task circumplex is also evident in the studies that provide precedent for the approach we demonstrate in the present study (e.g., Engel et al., 2014; Riedl et al., 2021; Woolley et al., 2010).

The first production protocol was a simplified version of the Delphi method which uses an iterative cycle of idea generation and consensus building among group members. Delphi methods have been shown to markedly improve “group” performance on forecasting tasks (Hemming et al., 2018; Wintle et al., 2023) by mitigating group biases such as anchoring, group think, and overconfidence. In the first protocol, participants were required to tackle all problems without being able to communicate or share answers with other team members during the first 24 hours. After the initial 24 hours had elapsed, all individual attempts at solving the problems were shared and the team attempted to reach consensus. In the second protocol, participants were given the latitude to solve the problems how they wished and to communicate and share answers with other team members from the outset. Each team submitted a single answer to each problem, though not all teams completed all tasks (and some answers were excluded due to poor quality). For this study we pooled all rationales, irrespective of how they were produced. In total, 279 rationales (avg = 162 words, SD = 132 words) were collected.

Procedure

Raters were provided with the following instructions:

A set of complex questions were presented to teams of individuals to solve within 48 hours. Teams were asked to both: 1) Provide the correct answer to each problem, and 2) To provide the background rationale for their answer. In the current HIT, we will 1) Present you with the problems participants were shown, and 2) Ask you to evaluate the reasoning of the answers teams generated. Two pieces of rationale will be presented at the same time: Your task is to decide which team you think justified their answer best by clicking on your preferred rationale.

Raters were then presented with a randomly allocated problem statement (Fig. 2a). Once they had read through the problem statement, raters were presented with two randomly selected rationales corresponding to the problem statement (Fig. 2b) from our corpus of rationales. The rationale that was deemed to be “better justified” was then chosen by the rater. Once the choice was made, they were presented with two more randomly drawn rationales. On average, each rater saw 26.4 pairs of rationales (SD = 31.1). This amounted to a total of 1915 comparisons and choices. Raters were not informed under which production protocol a rationale was generated, what size the team that produced it was, or how accurate the rationale was, and in many cases, both items in a pair were equally accurate. Data collection took place in May 2019.

To assess the relationship between accuracy and the forced-choice measure of quality of reasoning, we first calculated accuracy at the problem level (i.e., Doc_ID_1, GEO_1). Some of the problems in our corpus included multiple questions (see Fig. 2). For each comparison we presented to raters (i.e., team I3’s answer to Bay_1 vs. team I9’s answer to Bay_1), we calculated which team provided more accurate answers to each question (i.e., team I3 or team I9). The team who answered more questions correctly was deemed to have provided a more accurate overall solution to the problem. Once these results were recorded, we were able to combine this information with the results from the forced-choice ratings to assess the probability that a rater would choose a rationale corresponding to a more accurate solution. Answers that were equally accurate were not considered for this analysis.

Results

(1)
Accuracy. Novices chose the rationale supporting the more accurate solution 62.2% of the time (SE = 1%). See Table 3 for further details.
(2)
Comparison between team sizes. Larger teams produced rationales that were more likely to be chosen compared to teams with fewer members (Table 2). For instance, the probability of MTurk participants choosing a rationale produced by a team of 21 (column) over one produced by a team of 5 (row) was .82 (SD = .02). This corresponds to an effect size of 1.29 (SD = .22, see row 21, column 5 in the MTurk panel of Table 2).
(3)
Inter-rater reliability. The percent agreement between raters was 70.58% (95% CI = 1.18). Chance agreement is 50%, so performance is significantly and substantially better than chance, although far from perfect.
(4)
Response time. While raters must read products upon first presentation, most comparisons were between pairs of products that raters had previously read, and judgments were made quite rapidly. The median reaction time per comparison was just ~9 seconds (mean response time = 29.9 seconds; SD = 100.07). The median response times per problem are outlined in Table 3.

Table 2 Bayesian probability estimates of choosing products created by the team with higher numbers of allocated members, by MTurk and expert raters. Below the diagonal line are mean probabilities (and SD); above the diagonal line are effect sizes (and SD). Responses by MTurk raters and expert raters are the left and right halves of the table, respectively

Full size table

Table 3 Descriptive statistics by problem for average proportion correct, median response time in seconds, and percentage of forced-choice responses that reflect proximity to the correct answer for MTurk and expert raters

Full size table

Discussion

Determining quality of reasoning is inherently subjective and context-dependent (Woods, 2013). Even when provided with detailed guidance, human raters tend to exhibit judgments that have low reliability (e.g., Wachsmuth et al., 2017). Study 1 establishes that a forced-choice design can be used to evaluate quality reasoning. Prompting novice raters to make comparative assessments of reasoning between similar products tends to facilitate valid, reliable, and efficient judgments that align with various dimensions of accuracy. This finding confirms our pre-registered hypothesis that more accurate solutions would tend to be associated with the chosen rationale in a forced-choice comparison.

A written rationale supporting a more accurate solution was significantly more likely to be chosen over a less accurate one, and this trend was relatively strong even among individual raters with no prior training and only minimal guidance. Furthermore, these trends were observed across a wide range of problems with different kinds of reasoning and different levels of difficulty. Indeed, while the proportion of correct answers to the Bay_1 problem was only .07, raters nevertheless selected the Bay_1 rationale supporting a more accurate solution in 52% of cases. For Doc_ID the proportion of correct answers was .17, but raters achieved 74% accuracy (Table 3).

Second, we expected that larger teams would outperform smaller ones. This was reflected in our second hypothesis, which was supported by the results: novices consistently selected the reports generated by larger teams as being better reasoned, amounting to substantial effects (Table 2).

Finally, our secondary analysis found that raters made relatively accurate forced-choice comparisons in a brief amount of time. The median reaction time was ~9 seconds for MTurk participants; however, it should be noted that this trend is not obvious when using the statistical mean because the distribution was highly skewed by the initial reading of the products, which typically takes most participants significantly longer than 9 seconds.