Abstract
In multiple-choice tests, guessing is a source of test error which can be suppressed if its expected score is made negative by either penalizing wrong answers or rewarding expressions of partial knowledge. Starting from the most general formulation of the necessary and sufficient scoring conditions for guessing to lead to an expected loss beyond the test-taker’s knowledge, we formulate a class of optimal scoring functions, including the proposal by Zapechelnyuk (Econ. Lett. 132, 24–27 (2015)) as a special case. We then consider an arbitrary multiple-choice test taken by a rational test-taker whose knowledge of a test item is defined by the fraction of the answer options which can be ruled out. For this model, we study the statistical properties of the obtained score for both standard marking (where guessing is not penalized), and marking where guessing is suppressed either by expensive score penalties for incorrect answers or by different marking schemes that reward partial knowledge.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The multiple-choice test item is objective in the sense that the correct response is unambiguously defined; but since the correct answer is “hidden in plain view”, the construction of this item is more complicated than that of the so-called “open question” or constructed-response item, where the test-taker produces her response unaided rather than selects one among preconstructed ones. However, since the extra complexity in item construction relative to the constructed-response item is independent of the number of test-takers, the ease of scoring and administering multiple-choice items leads to great savings when the number of test-takers is large.
By the “multiple-choice item”, we restrict ourselves to the special case of “selected-response items” where the test-taker knows that out of the many options presented only one is correct (the incorrect options are known as “distractors” and the correct one as the “key”). This in itself is not sufficient to define the item uniquely, as there are many different ways of assigning scores among all possible responses—for instance, some distractors might be more “wrong” than others, or the response may consist of two options rather than one selected, and so on. The prototypical scoring function is to assign credit only for the correct response, defined as the test-taker’s correct and unambiguous designation of the correct answer option. However, if the test-taker is rational and seeks to maximize her expected score, scores by this scoring function then suffer from confounding variance due to guessing [1,2,46].
Whereas computerized forms of multiple-choice testing where the answer options are presented one-at-a-time may reduce the test-taker’s propensity for guessing [33], guessing remains a rational strategy and this solution is in any case impossible to implement in pencil-and-paper forms; this applies also to the variation where the test-taker is allowed “answers until correct” [42]. A direct solution to the problem is merely to increase the test length with more test items, but since item construction is cumbersome, it is preferable to solve the problem by changing the scoring function.
The most straightforward solution is then to subtract points for incorrect answers, a procedure sometimes called “formula scoring” [27, 9]. This provides strong incentive against guessing but may entail a perceived problem of “fairness” in the test-taker population. Additional variance due to differences in risk-taking behavior is also introduced [15], and an “intimidation factor”, detrimental to the test validity, associated with this type of scoring has been reported [7, 5]. However, this latter effect can be psychologically mitigated, if not eliminated, by providing points for skipped questions instead of subtracting points for incorrect answers [39, 34].
It has also been suggested [13, 3, 1, 11, 8, 44, 32, 20, 18, 22, 37, among others] that the space of possible responses to a multiple-choice test item can be scored in a much more nuanced way by assigning “partial credit” to test-takers who indicate correctly that they know some options are wrong, rather than hazard a guess on the right answer. The added complexity of choice does not seem to pose any significant problems for the test-takers [4]. Besides the greater discriminatory power that the test is supposed to achieve like this by extending the effective scoring range, it is also an instrument to penalize blind guessing by rewarding the expression of “partial knowledge”. The influence of such “partial knowledge” on the solution of the multiple-choice test item is apparent, for instance, in the study by Medawela et al. [30] on dentistry students where parallel forms of multiple-choice and fill-in-the-blanks tests were used.
A non-technical review of a small subset of the above-cited scoring functions is presented by Lesage et al. [26] and, in a more technical sense, computer simulations aiming to establish the effect of different scoring functions on the test reliability are reported by Chica and Tárrago [12] and Frary [21]. The fundamental problem is the arbitrariness inherent in how to award partial credit for a response. While some authors derive scoring functions from different axioms [44, 20], this approach ignores the question of the actual validity of the test and it may lead to unintended results, no matter how simple or appealing the underlying axioms. In general, corroborating tests and their scoring functions against real-world performance is a time-consuming and expensive endeavor. For instance, the results of Vanderoost et al. [41] are limited to the particular partial scoring model – the one due to Arnold and Arnold [3] – evaluated in their study.
In regard to the above-mentioned considerations, we present a formalization of the probability of solution of a multiple-choice test item as a function of the fraction of the material that the test-taker is presumed to know, and use it to investigate theoretically the characteristics of a number of different scoring functions for the rational test-taker striving to maximize her score. We are then in a position to evaluate which, if any, scoring function is superior with respect to different statistics, and also to draw some general conclusions regarding different scoring functions.
2 Latent-score model
2.1 Partial knowledge
Throughout this paper, we adopt an axiomatic approach where the latent score is to be defined transparently a priori rather than extracted statistically a posteriori. This means that we seek a definition of “partial knowledge” that can be evaluated independently of the multiple-choice test and which is intuitively obvious. The axiomatic approach further implies that we can only provide justification for our definition by way of examples.
Consider now the following hypothetical test item:
Which is the French word for cat?
A. chien B. chat C. choux D. chouette
A test-taker that knows the meaning of any of the distractors (chien, choux, chouette) can readily rule them out. This we take to be a reflection of the test-taker’s “partial knowledge”. If she honestly indicates, for instance, that “choux” and “chouette” are not the correct alternative, she has provided evidence that she knows the English meaning of these two words (for else how could she know they are not correct?). If asked, in a parallel form of the test with constructed responses, to give the English meaning of each of these four alternatives, she would succeed at two and fail at the others.
Consider now another example [based on the one by [29]]:
Which is the capital of Spain?
A. Milano B. Lisbon C. Madrid D. Barcelona
If the test-taker knows that Milano is in Italy, or that Barcelona is in Spain but does not hold the status of national capital, then she could rule these out, thereby making use of her “partial knowledge”. To capture this knowledge in a constructed-response parallel form, the test-taker could be asked for each of these cities in turn to provide their country of location and status (capital or not). From her answers, it should be clear whether her knowledge is sufficient to rule them out in this example item. In fact, simply asking the test-taker a series of open questions of the form “tell us about X”, where X is an answer option, should suffice.
At this point, we have to consider another type of multiple-choice test item, for which the concept of “partial knowledge” appears ill-defined, for instance, an item like:
Which of the below options is a synonym of “tenebrous”?
A. happy B. sad C. bright D. dark
Quite clearly, the test-taker might know the meaning of each of the four alternatives and still not know the meaning of “tenebrous” used in the item stem. But if the item is reformulated, for instance, by asking “Pick the two words below that are synonyms”, partial knowledge can be used to the test-taker’s advantage by the method of exclusion. A parallel form with constructed responses for this last variant of the test would ask for synonyms of each of the alternatives in turn.
From these examples, we propose the following definition.
Definition 1
For tests of vocabulary or, more generally, factual knowledge, provided that all of the terms in the stem of the multiple-choice test item are known and understood, the “latent score” on the item can be ascertained by the number of correct responses on a parallel constructed-response form where the test-taker is quizzed in isolation on the meaning or implications of each of the answer options, if such a parallel form can be constructed.
Note that we do not necessarily assume that the chance of correctly guessing on a constructed-response test item is zero. If the test-taker is sufficiently knowledgeable to narrow down the possible answers (without any cues) to such an extent that guessing correctly becomes likely, then a correct answer is simply counted as evidence of her knowledge. This does not affect our formalization of the test item solution to be presented below, as she would be able to use the same reasoning to rule out distractors, but illustrates operationally where we imagine the line between “known” and “unknown” is drawn in terms of certainty.
The distinction between “known” and “unknown” status of any answer option is meaningful for the test of factual knowledge, but not for tests of reasoning where the solution requires progression through several steps. For instance, in a test of mathematics, a test-taker may only be able to arrive at a partial solution of the problem; however, the rational test-taker that is aware that she has not solved the problem completely will exclude any answer option corresponding to this partial solution, if such an option is present, and then guess the correct answer among the remaining ones. This leads to an inversion of the credits model we consider and test items of this type will not be considered in this paper.
In the most general case that we consider, the test-taker is expected to answer, with full credit, all items correctly for which she knows either the meaning (and implications) of the key, or for which she knows the same for all of the distractors; and to answer, with partial credit, all other items for which she has only partial knowledge of the answer options. Keeping a general approach, by “partial credit” we consider also “no credit” as a special case, as long as the answer is not completely erroneous (i. e., the correct option not indicated at all). Nevertheless, it should be clear – and we will return to this point in a subsequent section – that our definition of the latent score (that is, as determined by the parallel constructed-response form of the multiple-choice test) cannot map perfectly to any partial credit scoring function for the multiple-choice test item. For instance, providing the correct answer to an item can be the result of luck, of knowing the key but no distractors, of knowing all of the distractors but not the key, or of knowing all distractors and key.
We shall consider an item to be correctly answered if only the key is indicated; if the key is indicated in addition to any number of distractors, the answer is only partially correct. If only distractors are indicated as correct, we take the answer to be incorrect. We take the blank answer to be partially correct (since it is equivalent to indicating all answer options). When necessary, we will also make a distinction between the “true score” and the “Platonic true score” [24]. The “true score” is in theory the average value of the observed score on infinitely repeated administrations of the same item and it may include effects of chance, whereas the “Platonic true score” is the actual score of the test-taker that is free of guessing.
2.2 Formalization
Since we are only interested in test statistics, the actual content is immaterial and the model can be readily formalized as follows. First, we consider that an answer alternative is “known” if the test-taker is “certain” (in the sense that she would answer correctly on a parallel constructed-response form) that it is either key or distractor, without needing to know the other alternatives. An alternative is “unknown” if the test-taker is “uncertain” about its status (we do not distinguish between different levels of “uncertainty” here). Second, we define a hypothetical “test” by assuming it contains a fixed number of items with c alternatives each, all of which are unique and only appear once in the whole test. Third, we consider a set of “test-takers”, each of whom knows a fixed but individual number of concepts drawn randomly from the keys and distractors of the hypothetical test. We may thus consider an arbitrary number of test-takers taking any number of arbitrarily designed tests using random distributions to mimic real-world test-taking populations of any desired statistical characteristics.
Let us denote the probability that the test-taker that masters a fraction f of the material knows the answer to an item with c options with complete certainty as \(P_1(f, c)\). Likewise, we denote the probability that she knows with absolute certainty the key to be one of two options, but not which one, as \(P_2(f, c)\), and so on. We assume that there is no statistical difference between the probability of knowing a key or a distractor (which means that test-taker is expected to know the same fraction f of all keys and distractors on the test, on average, if queried on a parallel form, which is our definition of the latent score). In other words, we assume that the completely ignorant test-taker (with \(f = 0\)) would assign equal probabilities to each of the alternatives of a test item being correct. The total probability of providing the correct answer, for a randomly chosen pair of rational test-taker and item is [23, rewritten here in differential form]
where each term corresponds in turn to the probability of guessing the correct option among \(1, 2, \ldots , c\) alternatives, having excluded \(c-1, c-2, \ldots , 0\) of the answer options by partial knowledge. In the Appendix, we show that
and
for \(1< k < c\), and \(P_c(f,c) = (1-f)^c\).
3 Error-minimizing scoring function
Out of a concern for clarity in the exposition, we shall for now leave the probabilistic description of the test-taker’s knowledge, quantified by the parameter f, and briefly turn to a deterministic description of the test-taker’s knowledge, quantified by the number of unknown answer options, i. We will return to the probabilistic description in Sect. 3.2.
Let \(p_n\) denote the score awarded for indicating n answer options, of which one is correct, in a multiple-choice test item. In other words, \(p_n, n > 1\) designates the point value of a partially correct answer and \(p_1\) denotes the point value of a fully correct one. If there are i unknown alternatives (one key and \(i-1\) distractors), the expected value of the score when randomly guessing n answer options among them is then
since there are \(\left( \begin{array}{c} i \\ n \end{array} \right) \) different ways of indicating an answer comprising n among c options but only \(\left( \begin{array}{c} i - 1 \\ n - 1 \end{array} \right) \) different ways of indicating answers which all contain the key. The ratio of these two quantities is the probability that the key is among the n options indicated in the answer and such an answer merits a score of \(p_n\) whereas the incorrect answer is, for simplicity, assumed worth zero points.
Likewise, the score variance when guessing n answer options is given by,
The idea is to choose the set of coefficients \(\{p_j\}_{j=1}^c\) to minimize the variance whenever the test-taker tries to maximize the expected value. We assume that the rational test-taker will exclude answer options that she knows to be wrong (effectively decreasing the value of i), and then provide an answer comprising n of the remaining options.
3.1 Necessary and sufficient conditions to suppress guessing
The strategic choices that the test-taker makes will depend on her knowledge, which is unknown to the test-maker at the time of test construction. Assume that faced with a question item, the test-taker uses her knowledge to narrow the feasible options down to i alternatives. We want to ensure that she provides a partial answer comprising not more and not fewer than these i alternatives, since otherwise an element of chance unrelated to her knowledge is introduced in the score (increasing the test-retest variance).
The deterministic function S(i, n) gives the expected score when indicating by guesswork n options among i unknown as correct and we want to ensure that the rational strategy is to mark all of these i options and not only gamble on a subset of, say, \(m < i\) options. We also want to ensure that the rational strategy precludes inclusion of options that the test-taker knows to be wrong, which would lead to a scoring function with some particularly undesirable properties. Therefore, we require that \(S(i, i) > S(i, m)\) for \(m < i\) (penalizing the gamble of answering more narrowly than warranted by partial knowledge) as well as \(S(i,i) > S(i+1,i+1)\) for \(i<c\) (penalizing the “hedging-your-bets” strategy of answering more inclusively than warranted).
The first inequality leads to the condition:
A natural way to satisfy it is to write the recursive relation:
where \(\{\epsilon _i\}\) are positive constants. This recursive relation can be rewritten in closed form as,
The second inequality then leads to the requirement,
where \(c > k \ge 1\). This imposes an upper limit on the values of \(\{\epsilon _j\}\); however, in practice, we will consider only small values and will not need to pay any explicit heed to this constraint.
The ideal test-retest variance due to guessing is eliminated for the rational or risk-averse test-taker with all scoring functions for which the expected value of a guess beyond the test-taker’s knowledge is negative, and inequalities (6) and (9) provide this condition. However, in the actual testing situation, the only variance observed is that among the test-takers, and this variance is not only due to random guessing, since it includes also differences in measured ability in the test-taker population. The latter variance should be conserved in the test situation. We shall therefore narrow down our choice for \(p_i\) further.
3.2 Relation between item and latent score
Assume a test-taker knows a fraction f of the material (distractors and keys; f is directly proportional to the extent of the test-taker’s factual knowledge in the domain tested). The expected value of the score that she will get for an item, relative to a blank answer for which she is awarded \(p_c\) points, is then \(\sum _{i}^{c-1} (p_i - p_c) P_i(f, c)\) but in order for the points awarded to reflect the expected knowledge, this sum should equal f (or, at any rate, f times a constant). We thus have an equation,
Since \(P_i(f, c)\) is polynomial of degree c, one obtains from this equation a linear system of c equations and \(c-1\) unknowns by comparing coefficients in f. Since the system is overdetermined, we conclude that a perfectly linear correlation between the expected test score and our model for the latent score is unattainable for a multiple-choice test. This is the conclusion we drew earlier in a preceding section, but repeated here on more formal grounds.
To make headway, we pursue an alternative (albeit approximate) approach in which we consider that the test-taker knows a fraction g/c of the answer options for an item, where \(g = 1,\ldots , c - 1\), and then compute the expected score relative to the blank answer under the following assumption: if the key is known (which it is with probability g/c), the test-taker answers correctly for \(p_1\) points; otherwise, she answers partially correctly for \(p_{c-g}\) points. In this case, requiring that the expected score with respect to the blank answer equals the knowledge possessed, one obtains a linear system of \(c - 1\) equations of the form,
If one then sets \(p_c = 1/c\), the solution becomes the scoring function of Zapechelnyuk [44] – derived independently, apparently, by Otoyo and Bush [32] on heuristic grounds – where in our notation \(\epsilon _i = 0\) for \(i > 1\) and \(\epsilon _1 = 1\). The fact that \(\epsilon _i = 0\) for \(i > 1\) implies a weak violation of inequality (6), meaning both that the risk-neutral test-taker may guess (increased random variance) and that the latent score of the risk-averse test-taker may be underestimated.
Since keeping the relation of the score with the fraction of knowledge possessed as linear as possible is desirable, one might inquire about other choices of \(p_c\). One can easily verify that values of \(p_c < 1 / c\) will violate the condition that \(\epsilon _i > 0\) even further, but values of \(p_c > 1/c\) lead to compliance (\(\epsilon _i > 0\)) as long as the upper limit for \(p_c\) implied by inequality (9) is respected. Moreover, the point difference with respect to the blank answer remains the same, but the relative point gaps between consecutive types of answers change. Thus, in reality, the precise value of \(p_c > 1/c\) will be dictated by the psychology of the test-takers (e. g., degree of risk aversion or risk seeking).
4 Comparison of scoring functions for test items
Having established our formalization of the rational test-taker and her latent score, we shall investigate different scoring functions for the test items for purposes of illustration and of comparison. These different functions correspond to different scores awarded for different types of answers to one and the same item, the type of answer in turn being dictated by the rational test-taker that we consider in our formalization. For brevity, we will not consider every example which can be found in the literature, even if the mathematical approach is general enough. In the following, we let m denote the number of distractors known to the test-taker and c the total number of answer options on the multiple-choice test item. Nota bene, we reuse the symbol S from the previous section to denote the scoring function, but will consider it as a stochastic function of c and f, rather than as a deterministic function of c and n.
In the first scoring function considered, corresponding to the typical “number correct” scheme, the test-taker is awarded \(p_1\) points if either the key or all of the distractors are known. If not, \(p_1\) points are awarded with a probability of \(1/(c-m)\). This situation corresponds to the rational test-taker guessing whenever in doubt, and answering correctly whenever certain. In our formalization, the first two moments of this score function are
and
where “NC” stands for “number correct”.
In the second scoring function we consider, corresponding to a modified Zapechelnyuk (MZ) scoring function (with \(p_c > 1/c\)), points are awarded according to the following procedure:
-
If either the key or all of the distractors are known, the test-taker is awarded \(p_1\) points.
-
If the key is not known, the test-taker is awarded \(p_{c-m}\) points.
The set \(\{p_i\}\) is defined by solving the system of equations implicit in Eq. (11) with \(p_c > 1/c\). The precise value of \(p_c\) has no effect on the behavior of our formalized, rational test-takers as long as it is greater than 1/c. In formulae, we have
and
for the first and second moments, respectively.
In the third one, proposed by Frandsen and Schwartzbach [20], \(p_1 \ln (c)\) points are given if either the key is known or all distractors are and \(p_1 \ln (c/(c-m))\) points otherwise. In the original formulation, there is a variable point penalty for incorrect answers designed to nullify the expected score of guessing. We do not need to consider it explicitly here because the test-takers we model are not risk-seeking. Mathematically, we have
and
The subscript “FS” is a mnenomic for “Frendsen-Schwarzbach”.
In the fourth one, corresponding to the popular “subset selection” (SS) scoring first proposed by Dressel and Schmid [18], \(p_1\) points are awarded if the test-taker knows the key or all of the distractors, and otherwise \(p_1(1-(c-m-1)/(c-1))\) points are awarded. It must be pointed out that this scoring function, as formulated in the original reference, strictly violates Eq. (6) on several points. It is therefore implicitly assumed that a penalty for incorrect answers is also included in the scoring function so as to negate the expected value of all guesses. Our formalization gives the moments,
and
under these slightly modified rules.
Finally, in the last one, \(p_1\) points are awarded for the correct answer, and a “very large” number is subtracted as a penalty for providing the wrong answer, meaning in our case that the test-taker will answer if she knows the answer with complete certainty and leave it blank otherwise for no points. For our purposes, we do not need to specify exactly how large this penalty is, but it is chosen at least as large as to make the expected score of a random guess on only two options negative. This gives very simple expressions for the first and second moments,
and
respectively. The subscript “NG” stands for “no guessing”.
A test is composed of several independent test items, which we consider to be drawn randomly from a set of keys and distractors. We denote the score on an item by S, and the total score is then simply the sum of the score on each item (and the total variance is the sum of the individual variances). In what follows, we simply deal with a test composed of a single item for simplicity but without any loss of generality. Results reported for a fixed value of f can be interpreted either as an average over all items on an infinite test for a single test-taker of specific ability, or as the average over an infinite number of test-takers of fixed ability on a fixed item.
4.1 Validity
It is usually desirable to have a scoring function that gives as linear as possible a relation with the latent score in order to enhance and facilitate comparisons between test-takers. It also means that the observed score (with respect to the underlying ability) is given on an interval, as opposed to ordinal, scale.
Therefore, we take deviations from the perfectly linear relation between the scoring function and the latent score to measure the extent of the “invalidity” of the observed score; contrariwise, the scoring function exhibits high validity if this relation is perfectly linear. In other words, we interpret a scoring function to be “valid” if on average it predicts the f-score linearly, no matter how large the dispersion around this prediction (which we take to be captured by the reliability and the measurement precision).
We will consider two measures of this linearity, since we know already that it will be compromised for values of f close to unity (although to different extents for different scoring functions). Our first index is the linear correlation coefficient, which is population-independent; we also introduce a second one, a coefficient of validity corresponding in mathematical form closely with the reliability coefficient, both being computed from the observed variances in the test-taker population.
4.1.1 Linear correlation with latent score
Whereas the rank correlation between E(S) and f is unity for all of the considered scoring functions (ensuring they are valid for rank sorting and constitute at least a true ordinal scale), the linear Pearson correlation coefficient differs slightly between them. Keeping \(c = 4\) as our test case, the calculated results are reported in Table 1. In all cases, the greatest deviations from linearity are observed for f close to unity (not shown), which is expected because higher-order polynomial terms in Eq. (10) become important only at large f, being naturally suppressed for small f.
4.1.2 Coefficient of validity
Here we introduce a coefficient of validity along the same lines as the coefficient of reliability in the next section, that is, one which is computed from population variances.
The variance in S on an item among test-takers of fixed ability f is given by
where \(E(\cdot )\) denotes an expectation value. V[S(f, c)] is the variance for all test-takers of ability f, which is independent of the actual distribution of f. To obtain the population-averaged variance for all abilities, we integrate over the latent-score distribution,
where \(\phi (f)\) is the probability density function for f in the test-taker population. In the language of classical test theory, this variance is “error variance” (and not “true score” variance; vide infra) since the integrated variance stems from an individual score variance, V[S(f, c)], that is non-zero even in a hypothetical population with no variance in the ability, f. In addition, we define the expected error as
which can be seen to vanish for all f only if S(f, c) is linear in f, that is if \(E[S(f,c)] = fE[S(1,c)]\), and compute the variance of this expected error as
across f for the different scoring functions as per above. The variance \(\sigma _{\textrm{E}}^2\) represents the “Platonic true score” error variance, in that it is the variance of the deviation of the expected score from the value linearly predicted by the underlying ability. We now define a validity coefficient as the proportion of the total error that is not “Platonic true score” error variance, i.e.,
This coefficient is bounded between zero and unity and obtains its maximum when the prediction by the expectation never deviates from linearity. On the contrary, it obtains its minimum if there is no statistical uncertainty around erroneous predictions. This behavior agrees with the verbal definition given in Sect. 4.1.
Both variances are functions of the chosen f-distribution and to give arbitrary but clear indications of the effects of the different scoring functions, we consider a test with \(c=4\) and two different choices for \(\phi (f)\): one “broad” distribution (Distribution I), which we take to be the uniform distribution for \(f \in [0, 1]\), and one “narrow” distribution (Distribution II), which we take to be the normal distribution with mean \(E(f) = 0.6\) and standard deviation \(\sigma _f = 0.1\). Refer to Table 2 for the results. In general, the computed validity is smaller for Distribution I than for Distribution II and this decline in the accuracy (which is, however, not that substantial) is mainly a consequence of sampling the ability distribution for f close to unity.
4.2 Reliability
The total variance for the observed score is the sum of the variance in Eq. (23), representing the contribution to the variance from test-takers of different ability, and the two terms,
representing the “true score” variance, the “true score”, E[S(f, c)], being simply the expectation value of the observed score [24]. Hence, the reliability coefficient, given as the ratio of true score variance to total variance is,
Note that this reliability coefficient is that which is computed as an average over parallel forms of the test with non-identical items; it is not a test-retest coefficient. Thus a random element is present whether or not the test-taker knows the keys and distractors on the parallel form, but even if non-identical, the items are still equivalent from the perspective of the model.
Under the same conditions as the computed validity coefficients, the calculated results are reported in Table 3. All of the partial credits scoring functions exhibit increased reliability with respect to both NC and NG scoring. An increased reliability for NG vis-à-vis NC scoring is also apparent, but it does not quite reach up the level of the partial credits models. The modified SS scoring function (with penalties added for guessing) exhibits the highest predicted reliability. It is to be stressed that without this important modification, its reliability would be lower, approaching that of NC.
The decrease in computed reliability when moving from the broad to the narrow ability distribution is a consequence of the diminished true score variance, leaving more of the variance to chance effects. For a perfectly homogeneous distribution where all the test-takers share the same ability, there is no true score variance at all and the reliability coefficient is zero. The highest reliabilities will be obtained for distributions that are weighted toward the upper level of f-ability, because then the chance effects are reduced.
4.3 Discriminatory power
Polytomously scored items allow a finer discrimination among the observed scores compared to dichotomous ones. In this section, we seek to quantify this added value. In the literature, the “discrimination” of a test item is usually taken to be its Pearson correlation with the total test score. This is not sufficient for our analysis. We want to quantify the probability that a single test-taker receives, on a single item, a fair score reflecting her latent ability as precisely as possible. To be more precise: for two abilities \(f_1\), \(f_2\) with \(f_2 = f_1 + \epsilon \), \(\epsilon \ge 0\), a general scoring function S(f, c) will satisfy \(M[S(f_2, c)] = M[S(f_1, c)]\) for \(\epsilon \) sufficiently close to zero, where \(M(\cdot )\) designates the mode. These cases lead to loss of discrimination among test-takers and higher values of d (defined below) indicate less overall influence (in an average sense) of these errors on the score.
Let \(\pi (s; f, c)\) be the probability that a test-taker of ability f achieves at least a score s on an item with c options. Clearly, we have
and
independently of s for \(s > 0\) for any scoring function that does not award partial marks (the probability is unity for \(s = 0\)). With the MZ scoring function and no guessing, we have
and the cases for FS and SS follow analogously.
Denote the point value of a blank answer by \(p_\textrm{blank}\). Then we use the integral
as a measure of the discriminatory power of the scoring function. Here \(\delta p\) is the point difference between a fully correct and blank answer (usually this is \(p_1 - p_\textrm{blank}\) in the terminology of this paper, but for the FS rule, it is \(p_1 \ln c\)) and \(\Delta \) is a sensitivity parameter which we arbitrarily set \(\Delta = \delta p / 10\) for the results reported; d diverges as \(\Delta \rightarrow 0\) rendering any comparison impossible in this limit, but smaller values of \(\Delta \) lead to larger differences in d between the scoring functions. Results are given in Table 4.
As seen, the results are very sensitive to the underlying distribution, with a complete reversal of the ranking of the partial credits models between Distributions I and II. The small value for MZ for the narrow distribution centered on \(f = 0.6\) reflects its poor discriminatory power around \(f \approx 0.6\). In fact, this scoring function is most discriminatory for \(f < 0.5\) and becomes essentially dichotomous above that.
4.4 Relative precision
The precision of the score reflects the information gained, in the sense that one is certain that the obtained score is correct [31], and this statistic is distinct from the reliability which measures the extent to which the obtained scores for the same test-taker in independent measurements are correlated. We compute the relative precision by normalizing the standard deviation at fixed ability by the expectation of the score, thus
where V[S(f, c)] is given in Eq. (22) and \(p_\textrm{blank}(c)\) represents the score of a blank answer. This procedure yields a number that is more aptly termed “relative uncertainty” than “relative precision”, since larger values correspond to less precision. Averaged results over Distributions I and II are given in Table 5.
As opposed to the case of the reliability, the precision is increased for the narrower distribution with respect to the broad uniform distribution. This is mostly a consequence of the reduced chance variation for test-takers of higher ability. Indeed, for the test-taker of ability \(f=1\), the computed precision is perfect. As inferred from the results, in the lower-range of ability, the NG scoring function yields the most precise relative measurements, but this advantage is lost with more knowledgeable test-takers. Across both distributions, the modified SS rule is arguably the most precise. Note also that the large relative uncertainties (of the order of 30-50%) reported here are for a single test item. Unlike the reliability above, as the number of items is increased, the relative uncertainty decays asymptotically as an inverse square-root toward zero within the model.
5 Conclusion
Having considered a model where the test-taker is presumed to possess a definite set of “facts” that are reflected in the distractors and keys of the multiple-choice test, we have compared different scoring functions for the kind of multiple-choice test items for which partial knowledge is expected to contribute to the test-taking strategy. We should also point out that the formal model that we have presented, and which is the basis for our analysis, relies on an axiomatic representation of factual knowledge and the probability of solving a multiple-choice test item. The approach is hence fundamentally different from polytomous Rasch [29, 2] and item-response models [43], which rely on statistical fitting, but similar to knowledge space theory [17, 16] in its epistemological assumptions, of which it can be considered a limiting special case. We stress that the partial credits scoring functions investigated are intended to suppress guessing a priori; they hence differ fundamentally from any approach [38,39,40,35] which attempts to account for guessing a posteriori. These a posteriori methods are inherently unreliable whenever sample sizes are small and they are hence unsuitable for general classroom assessments, and limited to large-scale assessments.
Like Frary [21] and Chica and Tárrago [12] before us, we find that compared to the NC scoring function which does not suppress guessing, the computed reliabilities are vastly improved for all scoring functions that penalize incorrect answers; moreover, the partial credits scoring functions all exhibit further improvements in reliability over the dichotomously scored item. The argument for using at least some form of partial credits scoring functions is thus strong. Moreover, taking irrationality and risk-seeking behavior into account, Espinosa and Gardeazabal [19] find that the penalty for incorrect answers should exceed the typical one which is usually taken to precisely negate the effect of blind guessing for dichotomous items. It is likely that the same applies also to partial credits scoring, but all of the scoring functions considered in this work are readily adaptable in this regard: for instance, for the MZ scoring function, this means that \(p_c\) should be chosen sufficiently large.
While we have eschewed to take into account irrationality, risk-taking and unawareness of the extent of one’s own knowledge as factors in test-taker behavior, we can consider the results as a mathematical limiting case. Real-world testing situations will presumably approach this limit if the test-takers are clearly instructed that “guessing” in the sense of identifying more than one possible answer leads to an expected increase of their score, relative to guessing on a single answer. It is not altogether unlikely that this removes the “ethical dilemma” that apparently keeps some test-takers from guessing even when explicitly encouraged to do so [38]. Moreover, students have been found highly compliant with instructions not to guess even in the absence of any penalty scoring [14]. If both the score-optimal strategy and the instructions align, it is difficult to imagine anything but stronger compliance, especially if students are encouraged to be honest about their partial knowledge.
With regard to more facets than just the reliability of the test, we find that there is no single clearly superior scoring function among the ones tested, each having its own merits: the NG scoring function produces scores with the highest linear correlation with the underlying construct; the modified SS rule yields the highest reliability and precision (its unmodified form does not suppress guessing); the FS rule yields the most consistent item-level discrimination across the ability range, and so on. This explains the proliferation of different scoring functions in the literature. The virtues of the Zapechelnyuk scoring function [44], as regards the reliability, have been noted by Otoyo and Bush [32] in an empirical study comparing it to traditional NC scoring. Here, we replicate this finding theoretically, but also report that other partial credits scoring functions should yield even higher reliabilities, a finding which should be confirmed empirically. We also note that it seems more appropriate for scoring low-ability populations, on account of its low discriminatory power at the higher end of the f-distribution.
References
Akeroyd, M.: Progress in Multiple Choice Scoring Methods, 1977/81. J. Furth. High. Educ. 6(3), 86–90 (1982)
Andrich, D.: A rating formulation for ordered response categories. Psychometrika 43(4), 561–573 (1978)
Arnold, J., Arnold, P.: On scoring multiple choice exams allowing for partial knowledge. J. Exp. Edu. 39(1), 8–13 (1970)
Ben-Simon, A., Budescu, D.V., Nevo, B.: A comparative study of measures of partial knowledge in multiple-choice tests. Appl. Psychol. Meas. 21(1), 65–88 (1997)
Betts, L.R., Elder, T.J., Hartley, J., Trueman, M.: Does correction for guessing reduce students’ performance on multiple-choice examinations? Yes? No? Sometimes? Assess. Eval. High. Educ. 34(1), 1–15 (2009)
Birnbaum, A.: Some latent trait models and their use in inferring an examinee’s ability. In: Lord, F.M., Novick, M.R. (eds.) Statistical Theories of Mental Test Scores. Information Age Publishing, Charlotteville, North Carolina (2008) . (Chap. 17)
Bliss, L.B.: A test of Lord’s assumption regarding examinee guessing behavior on multiple-choice tests using elementary school students. J. Edu. Meas. 17, 147–153 (1980)
Bradbard, D.A., Parker, D.F., Stone, G.L.: An alternate multiple-choice scoring procedure in a macroeconomics course. Decis. Sci. J. Innov. Educ. 2(1), 11–26 (2004)
Budescu, D., Bar-Hillel, M.: To guess or not to guess: A decision-theoretic view of formula scoring. J. Edu. Meas. 30(4), 277–291 (1993)
Burton, R.F., Miller, D.J.: Statistical modelling of multiple-choice and true/false tests: ways of considering, and of reducing, the uncertainties attributable to guessing. Assess. Eval. High. Edu. 24(4), 399–411 (1999)
Bush, M.: A multiple choice test that rewards partial knowledge. J. Furth. High. Educ. 25(2), 157–163 (2001)
Chica, J.C., Tárrago, M.J.G.: Estudio de la fiabilidad de test multirrespuesta con el método de Monte Carlo. Revista de Educación 392, 63–95 (2021)
Coombs, C.H., Milholland, J.E., Womer, F.B.: The assessment of partial knowledge. Edu. Psychol. Meas. 16(1), 13–37 (1956)
Delgado, A.R.: Using the Rasch model to quantify the causal effect of test instructions. Behav. Res. Method. 39(3), 570–573 (2007)
Diamond, J., Evans, W.: The correction for guessing. Rev. Edu. Res. 43(2), 181–191 (1973)
Doignon, J.-P., Falmagne, J.-C.: Spaces for the assessment of knowledge. Int. J. Man-Mach. Stud. 23(2), 175–196 (1985)
Doignon, J.-P., Falmagne, J.-C.: Knowledge Spaces. Springer, Germany (2012)
Dressel, P.L., Schmid, J.: Some modifications of the multiple-choice item. Educ. Psychol. Meas. 13(4), 574–595 (1953)
Espinosa, M.P., Gardeazabal, J.: Optimal correction for guessing in multiple-choice tests. J. Math. Psychol. 54(5), 415–425 (2010)
Frandsen, G.S., Schwartzbach, M.I.: A singular choice for multiple choice. ACM SIGCSE Bulletin 38(4), 34–38 (2006)
Frary, R.B.: A simulation study of reliability and validity of multiple-choice test scores under six response-scoring modes. J. Edu. Stat. 7(4), 333–351 (1982)
Gibbons, J.D., Olkin, I., Sobel, M.: A subset selection technique for scoring items on a multiple choice test. Psychometrika 44(3), 259–270 (1979)
Horst, P.: The difficulty of a multiple choice test item. J. Edu. Psychol. 24(3), 229 (1933)
Klein, D.F., Cleary, T.A.: Platonic true scores: Further comment. Psychol. Bull. 71(4), 278 (1969)
Lee, S., Bolt, D.M.: An alternative to the 3pl: Using asymmetric item characteristic curves to address guessing effects. J. Edu. Meas. 55(1), 90–111 (2018)
Lesage, E., Valcke, M., Sabbe, E.: Scoring methods for multiple choice assessment in higher education-Is it still a matter of number right scoring or negative marking? Stud. Educ. Evaluat. 39(3), 188–193 (2013)
Lord, F.M.: Formula scoring and number-right scoring. J. Edu. Meas. 12, 7–11 (1975)
Martín, E.S., Del Pino, G., De Boeck, P.: Irt models for ability-based guessing. Appl. Psychol. Meas. 30(3), 183–203 (2006)
Masters, G.N.: A rasch model for partial credit scoring. Psychometrika 47(2), 149–174 (1982)
Medawela, R.S.H.B., Ratnayake, D.R.D.L., Abeyasinghe, W.A.M.U.L., Jayasinghe, R.D., Marambe, K.N.: Effectiveness of “fill in the blanks” over multiple choice questions in assessing final year dental undergraduates. Educación Médica 19(2), 72–76 (2018)
Mellenbergh, G.J.: Measurement precision in test score and item response models. Psychol. Method. 1(3), 293 (1996)
Otoyo, L., Bush, M.: Addressing the shortcomings of traditional multiple-choice tests: subset selection without mark deductions. Pract. Assess. Res. Eval. 23(1), 18 (2018)
Papenberg, M., Diedenhofen, B., Musch, J.: An experimental validation of sequential multiple-choice tests. J. Exp. Edu. 89(2), 402–421 (2021)
Prieto, G., Delgado, A.R.: The effect of instructions on multiple-choice test scores. Eur. J. Psychol. Assess. 15(2), 143 (1999)
Ramsay, J., Wiberg, M., Li, J.: Full information optimal scoring. J. Edu. Behav. Stat. 45(3), 297–315 (2020)
Rasch, G.: Probabilistic Models for Some Intelligence and Attainment Tests. Danmarks Paedagogiska Institut, Copenhagen, Denmark (1960)
Slepkov, A.D., Godfrey, A.T.: Partial credit in answer-until-correct multiple-choice tests deployed in a classroom setting. Appl. Meas. Educ. 32(2), 138–150 (2019)
Traub, R.E., Hambleton, R.K.: The Effect of Scoring Instructions and Degree of Speededness on the Validity and Reliability of Multiple-Choice Tests. Edu. Psychol. Meas. 32(3), 737–758 (1972)
Traub, R.E., Hambleton, R.K., Singh, B.: Effects of promised reward and threatened penalty on performance of a multiple-choice vocabulary test. Edu. Psychol. Meas. 29(4), 847–861 (1969)
Tversky, A.: On the optimal number of alternatives at a choice point. J. Math. Psychol. 1(2), 386–391 (1964)
Vanderoost, J., Janssen, R., Eggermont, J., Callens, R., De Laet, T.: Elimination testing with adapted scoring reduces guessing and anxiety in multiple-choice assessments, but does not increase grade average in comparison with negative marking. PLoS One 13(10), 0203931 (2018)
Wilcox, R.R.: Solving measurement problems with an answer-until-correct scoring procedure. Appl. Psychol. Meas. 5(3), 399–414 (1981)
Wu, Q., De Laet, T., Janssen, R.: Modeling partial knowledge on multiple-choice items using elimination testing. J. Edu. Meas. 56(2), 391–414 (2019)
Zapechelnyuk, A.: An axiomatization of multiple-choice test scoring. Econ. Lett. 132, 24–27 (2015)
Zimmerman, D.W., Williams, R.H.: Effect of chance success due to guessing on error of measurement in multiple-choice tests. Psychol. Rep. 16(3), 1193–1196 (1965)
Zimmerman, D.W., Williams, R.H.: A new look at the influence of guessing on the reliability of multiple-choice tests. Appl. Psychol. Meas. 27(5), 357–371 (2003)
Acknowledgements
Besides the anonymous referees, I acknowledge the valuable and close reading of the mansucript by the Editor-in-Chief, Dr Marco Alfò, and the suggestions that he provided.
Funding
Open access funding provided by University of Gothenburg.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Derivation of Eqs (2) and (3)
1.1 A.1 Derivation of \(P_k(f,c)\) for \(k > 1\)
The probability \(P_k(f,c)\) with \(k > 1\) is directly proportional to the probability of the test-taker knowing all but k alternatives. The probability that k of the alternatives are unknown and \(c-k\) are known is \((1-f)^k f^{c-k}\) multiplied by the number of groupings with k elements among the answer options, viz. the binomial coefficient
However, the probability that among these \(c-k\) known alternatives, none is actually the key is \((1-c^{-1})(1-(c-1)^{-1})\cdots (1-(k+1)^{-1})\). When the probability is corrected for this, we obtain Eq. (3).
1.2 A.2 The case \(k = 1\)
Let us first determine the contributions to \(P_1(f,c)\) which include the cases where the key is known. Assuming i distractors are known, then these probabilities are given by \(f^{i+1}(1-f)^{c-i-1}\) multiplied by the binomial coefficient
In addition to these probabilities, there is the special case of all distractors being known but not the key. The associated probability is \(f^{c-1}(1-f)\). The sum of all these contributions yields Eq. (2).
The attentive reader will have noticed that Eq. (2) may be simplified to,
corresponding to the sum of the two probabilities that the test-taker either knows the key or, barring that, all of the distractors (which are the only two ways that she may provide the correct answer with certainty). It is also assumed throughout that the set of possible keys and distractors is infinitely larger than the actual subset of keys and distractors chosen for the test. Otherwise, the probabilities will not be independent as we have assumed.
Appendix B Relation to the Rasch latent-trait model
We here point out the mathematical relation between our formalization and the Rasch model [36]. These two approaches differ in their assumptions; our formalization makes stronger assumptions and, correspondingly, is only applicable to the type of factual recall discussed above (with no shared keys or distractors between items). On the other hand, we can make stronger a priori predictions through eqs (2) and (3), for this particular type of knowledge tests, than the more general but less committal treatment of Rasch theory, or item response theory more generally, which includes more parameters.
In the Rasch model [36], each test item is associated with an intrinsic difficulty \(\delta \), and the probability of success, P, on the item is given by,
where \(\theta \) is the inherent ability of the test-taker, related in our model to the parameter f, whereas the difficulty parameter \(\delta \) is a constant (since there is no intrinsic variation in difficulty across items in our formalization; in other words, the test items do not form a Guttman scale). In fact, we assume that not only each test item is of the same difficulty but that each answer option is of the same “intrinsic difficulty”. We are not alone in this assumption that each answer option is of equal “difficulty”; it was also implicitly assumed by Tversky [40] in his theorem.
For dichotomous scoring, we have the function \(\theta (f)\) through equation (B2) upon substituting the expression for P(f, c) in Eq. (1) if guessing is part of the test-taker’s strategy, or the expression for \(P_1(f, c)\) in Eq. (2) in case it is not.Footnote 1 Since P(f, c) is a monotonous function of f, \(\theta (f)\) is invertible, and so Eq. (B2) represents a coordinate transformation between f-space and \(\theta \)-space for the latent ability. One might be tempted to conclude that if, in a Rasch analysis, \(\delta \) is not found constant across the test items, the assumptions of our formalization do not hold for that particular test; but one must keep in mind that because of measurement error alone, \(\delta \) will never be found precisely constant for any real-world data. Nevertheless, having error bounds on \(\delta \) allows us to say with some confidence whether the formalization is applicable or not.
The preceding discussion might seem to indicate that the model is very restrictive in its domain of application. However, if \(\delta \) is not constant, one may simply group the items by difficulty and apply the analysis to each group individually, each group of items being thus considered as an individual test. One should nevertheless not make the mistake to take the Rasch model as somehow “intrinsically” correct: indeed, for an item that tests only knowledge, it is difficult if not impossible to interpret the “difficulty” as an intrinsic property of the item (unlike, say, for items that requires computations or reasoning, which quite clearly require varying levels of “mental energy”). Without a model representation of the difficulty, it becomes simply a fitting parameter.
In fact, since the test items do not form a hierarchy of difficulty (Rasch difficulty parameters constant), the test is not unidimensional in the traditional sense of psychometrics. This means that the computed reliability coefficient is invariant to the addition of more test items within the model. Simply put, in the formalization that we apply, the N-item test has N orthogonal factors of equal prominence in factor analysis. This is a peculiarity of the mathematical model that we employ, and it can be interpreted as the model of a “pure” achievement test (as opposed to “ability tests” which measure some underlying mental ability).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Persson, R.A.X. Theoretical evaluation of partial credit scoring of the multiple-choice test item. METRON 81, 143–161 (2023). https://doi.org/10.1007/s40300-022-00237-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40300-022-00237-w