1 Introduction

The multiple-choice test item is objective in the sense that the correct response is unambiguously defined; but since the correct answer is “hidden in plain view”, the construction of this item is more complicated than that of the so-called “open question” or constructed-response item, where the test-taker produces her response unaided rather than selects one among preconstructed ones. However, since the extra complexity in item construction relative to the constructed-response item is independent of the number of test-takers, the ease of scoring and administering multiple-choice items leads to great savings when the number of test-takers is large.

By the “multiple-choice item”, we restrict ourselves to the special case of “selected-response items” where the test-taker knows that out of the many options presented only one is correct (the incorrect options are known as “distractors” and the correct one as the “key”). This in itself is not sufficient to define the item uniquely, as there are many different ways of assigning scores among all possible responses—for instance, some distractors might be more “wrong” than others, or the response may consist of two options rather than one selected, and so on. The prototypical scoring function is to assign credit only for the correct response, defined as the test-taker’s correct and unambiguous designation of the correct answer option. However, if the test-taker is rational and seeks to maximize her expected score, scores by this scoring function then suffer from confounding variance due to guessing [1,2,46].

Whereas computerized forms of multiple-choice testing where the answer options are presented one-at-a-time may reduce the test-taker’s propensity for guessing [33], guessing remains a rational strategy and this solution is in any case impossible to implement in pencil-and-paper forms; this applies also to the variation where the test-taker is allowed “answers until correct” [42]. A direct solution to the problem is merely to increase the test length with more test items, but since item construction is cumbersome, it is preferable to solve the problem by changing the scoring function.

The most straightforward solution is then to subtract points for incorrect answers, a procedure sometimes called “formula scoring” [27, 9]. This provides strong incentive against guessing but may entail a perceived problem of “fairness” in the test-taker population. Additional variance due to differences in risk-taking behavior is also introduced [15], and an “intimidation factor”, detrimental to the test validity, associated with this type of scoring has been reported [7, 5]. However, this latter effect can be psychologically mitigated, if not eliminated, by providing points for skipped questions instead of subtracting points for incorrect answers [39, 34].

It has also been suggested [13, 3, 1, 11, 8, 44, 32, 20, 18, 22, 37, among others] that the space of possible responses to a multiple-choice test item can be scored in a much more nuanced way by assigning “partial credit” to test-takers who indicate correctly that they know some options are wrong, rather than hazard a guess on the right answer. The added complexity of choice does not seem to pose any significant problems for the test-takers [4]. Besides the greater discriminatory power that the test is supposed to achieve like this by extending the effective scoring range, it is also an instrument to penalize blind guessing by rewarding the expression of “partial knowledge”. The influence of such “partial knowledge” on the solution of the multiple-choice test item is apparent, for instance, in the study by Medawela et al. [30] on dentistry students where parallel forms of multiple-choice and fill-in-the-blanks tests were used.

A non-technical review of a small subset of the above-cited scoring functions is presented by Lesage et al. [26] and, in a more technical sense, computer simulations aiming to establish the effect of different scoring functions on the test reliability are reported by Chica and Tárrago [12] and Frary [21]. The fundamental problem is the arbitrariness inherent in how to award partial credit for a response. While some authors derive scoring functions from different axioms [44, 20], this approach ignores the question of the actual validity of the test and it may lead to unintended results, no matter how simple or appealing the underlying axioms. In general, corroborating tests and their scoring functions against real-world performance is a time-consuming and expensive endeavor. For instance, the results of Vanderoost et al. [41] are limited to the particular partial scoring model – the one due to Arnold and Arnold [3] – evaluated in their study.

In regard to the above-mentioned considerations, we present a formalization of the probability of solution of a multiple-choice test item as a function of the fraction of the material that the test-taker is presumed to know, and use it to investigate theoretically the characteristics of a number of different scoring functions for the rational test-taker striving to maximize her score. We are then in a position to evaluate which, if any, scoring function is superior with respect to different statistics, and also to draw some general conclusions regarding different scoring functions.

2 Latent-score model

2.1 Partial knowledge

Throughout this paper, we adopt an axiomatic approach where the latent score is to be defined transparently a priori rather than extracted statistically a posteriori. This means that we seek a definition of “partial knowledge” that can be evaluated independently of the multiple-choice test and which is intuitively obvious. The axiomatic approach further implies that we can only provide justification for our definition by way of examples.

Consider now the following hypothetical test item:

Which is the French word for cat?

A. chien B. chat C. choux D. chouette

A test-taker that knows the meaning of any of the distractors (chien, choux, chouette) can readily rule them out. This we take to be a reflection of the test-taker’s “partial knowledge”. If she honestly indicates, for instance, that “choux” and “chouette” are not the correct alternative, she has provided evidence that she knows the English meaning of these two words (for else how could she know they are not correct?). If asked, in a parallel form of the test with constructed responses, to give the English meaning of each of these four alternatives, she would succeed at two and fail at the others.

Consider now another example [based on the one by [29]]:

Which is the capital of Spain?

A. Milano B. Lisbon C. Madrid D. Barcelona

If the test-taker knows that Milano is in Italy, or that Barcelona is in Spain but does not hold the status of national capital, then she could rule these out, thereby making use of her “partial knowledge”. To capture this knowledge in a constructed-response parallel form, the test-taker could be asked for each of these cities in turn to provide their country of location and status (capital or not). From her answers, it should be clear whether her knowledge is sufficient to rule them out in this example item. In fact, simply asking the test-taker a series of open questions of the form “tell us about X”, where X is an answer option, should suffice.

At this point, we have to consider another type of multiple-choice test item, for which the concept of “partial knowledge” appears ill-defined, for instance, an item like:

Which of the below options is a synonym of “tenebrous”?

A. happy B. sad C. bright D. dark

Quite clearly, the test-taker might know the meaning of each of the four alternatives and still not know the meaning of “tenebrous” used in the item stem. But if the item is reformulated, for instance, by asking “Pick the two words below that are synonyms”, partial knowledge can be used to the test-taker’s advantage by the method of exclusion. A parallel form with constructed responses for this last variant of the test would ask for synonyms of each of the alternatives in turn.

From these examples, we propose the following definition.

Definition 1

For tests of vocabulary or, more generally, factual knowledge, provided that all of the terms in the stem of the multiple-choice test item are known and understood, the “latent score” on the item can be ascertained by the number of correct responses on a parallel constructed-response form where the test-taker is quizzed in isolation on the meaning or implications of each of the answer options, if such a parallel form can be constructed.

Note that we do not necessarily assume that the chance of correctly guessing on a constructed-response test item is zero. If the test-taker is sufficiently knowledgeable to narrow down the possible answers (without any cues) to such an extent that guessing correctly becomes likely, then a correct answer is simply counted as evidence of her knowledge. This does not affect our formalization of the test item solution to be presented below, as she would be able to use the same reasoning to rule out distractors, but illustrates operationally where we imagine the line between “known” and “unknown” is drawn in terms of certainty.

The distinction between “known” and “unknown” status of any answer option is meaningful for the test of factual knowledge, but not for tests of reasoning where the solution requires progression through several steps. For instance, in a test of mathematics, a test-taker may only be able to arrive at a partial solution of the problem; however, the rational test-taker that is aware that she has not solved the problem completely will exclude any answer option corresponding to this partial solution, if such an option is present, and then guess the correct answer among the remaining ones. This leads to an inversion of the credits model we consider and test items of this type will not be considered in this paper.

In the most general case that we consider, the test-taker is expected to answer, with full credit, all items correctly for which she knows either the meaning (and implications) of the key, or for which she knows the same for all of the distractors; and to answer, with partial credit, all other items for which she has only partial knowledge of the answer options. Keeping a general approach, by “partial credit” we consider also “no credit” as a special case, as long as the answer is not completely erroneous (i. e., the correct option not indicated at all). Nevertheless, it should be clear – and we will return to this point in a subsequent section – that our definition of the latent score (that is, as determined by the parallel constructed-response form of the multiple-choice test) cannot map perfectly to any partial credit scoring function for the multiple-choice test item. For instance, providing the correct answer to an item can be the result of luck, of knowing the key but no distractors, of knowing all of the distractors but not the key, or of knowing all distractors and key.

We shall consider an item to be correctly answered if only the key is indicated; if the key is indicated in addition to any number of distractors, the answer is only partially correct. If only distractors are indicated as correct, we take the answer to be incorrect. We take the blank answer to be partially correct (since it is equivalent to indicating all answer options). When necessary, we will also make a distinction between the “true score” and the “Platonic true score” [24]. The “true score” is in theory the average value of the observed score on infinitely repeated administrations of the same item and it may include effects of chance, whereas the “Platonic true score” is the actual score of the test-taker that is free of guessing.

2.2 Formalization

Since we are only interested in test statistics, the actual content is immaterial and the model can be readily formalized as follows. First, we consider that an answer alternative is “known” if the test-taker is “certain” (in the sense that she would answer correctly on a parallel constructed-response form) that it is either key or distractor, without needing to know the other alternatives. An alternative is “unknown” if the test-taker is “uncertain” about its status (we do not distinguish between different levels of “uncertainty” here). Second, we define a hypothetical “test” by assuming it contains a fixed number of items with c alternatives each, all of which are unique and only appear once in the whole test. Third, we consider a set of “test-takers”, each of whom knows a fixed but individual number of concepts drawn randomly from the keys and distractors of the hypothetical test. We may thus consider an arbitrary number of test-takers taking any number of arbitrarily designed tests using random distributions to mimic real-world test-taking populations of any desired statistical characteristics.

Let us denote the probability that the test-taker that masters a fraction f of the material knows the answer to an item with c options with complete certainty as \(P_1(f, c)\). Likewise, we denote the probability that she knows with absolute certainty the key to be one of two options, but not which one, as \(P_2(f, c)\), and so on. We assume that there is no statistical difference between the probability of knowing a key or a distractor (which means that test-taker is expected to know the same fraction f of all keys and distractors on the test, on average, if queried on a parallel form, which is our definition of the latent score). In other words, we assume that the completely ignorant test-taker (with \(f = 0\)) would assign equal probabilities to each of the alternatives of a test item being correct. The total probability of providing the correct answer, for a randomly chosen pair of rational test-taker and item is [23, rewritten here in differential form]

$$\begin{aligned} P(f, c) = P_1(f,c) + \frac{P_2(f,c)}{2} + \ldots + \frac{P_c(f,c)}{c} \end{aligned}$$
(1)

where each term corresponds in turn to the probability of guessing the correct option among \(1, 2, \ldots , c\) alternatives, having excluded \(c-1, c-2, \ldots , 0\) of the answer options by partial knowledge. In the Appendix, we show that

$$\begin{aligned} P_1(f,c) = \sum _{i=0}^{c-1} \left( \begin{array}{c} c - 1 \\ i \end{array} \right) f^{i+1}(1-f)^{c-i-1} + f^{c-1}(1-f) \end{aligned}$$
(2)

and

$$\begin{aligned} P_k(f,c) = f^{c-k}(1-f)^k \left( \begin{array}{c} c \\ k \end{array} \right) \prod _{i=0}^{c-(k+1)} \frac{1}{c-i} \end{aligned}$$
(3)

for \(1< k < c\), and \(P_c(f,c) = (1-f)^c\).

3 Error-minimizing scoring function

Out of a concern for clarity in the exposition, we shall for now leave the probabilistic description of the test-taker’s knowledge, quantified by the parameter f, and briefly turn to a deterministic description of the test-taker’s knowledge, quantified by the number of unknown answer options, i. We will return to the probabilistic description in Sect. 3.2.

Let \(p_n\) denote the score awarded for indicating n answer options, of which one is correct, in a multiple-choice test item. In other words, \(p_n, n > 1\) designates the point value of a partially correct answer and \(p_1\) denotes the point value of a fully correct one. If there are i unknown alternatives (one key and \(i-1\) distractors), the expected value of the score when randomly guessing n answer options among them is then

$$\begin{aligned} S(i, n) = \frac{\left( \begin{array}{c} i-1 \\ n-1 \end{array} \right) }{\left( \begin{array}{c} i \\ n \end{array} \right) } p_n \end{aligned}$$
(4)

since there are \(\left( \begin{array}{c} i \\ n \end{array} \right) \) different ways of indicating an answer comprising n among c options but only \(\left( \begin{array}{c} i - 1 \\ n - 1 \end{array} \right) \) different ways of indicating answers which all contain the key. The ratio of these two quantities is the probability that the key is among the n options indicated in the answer and such an answer merits a score of \(p_n\) whereas the incorrect answer is, for simplicity, assumed worth zero points.

Likewise, the score variance when guessing n answer options is given by,

$$\begin{aligned} V(i, n) = \left( \frac{\left( \begin{array}{c} i - 1 \\ n - 1 \end{array} \right) }{\left( \begin{array}{c} i \\ n \end{array} \right) } - \frac{\left( \begin{array}{c} i - 1\\ n - 1 \end{array} \right) ^2}{\left( \begin{array}{c} i \\ n \end{array} \right) ^2} \right) p_n^2 \end{aligned}$$
(5)

The idea is to choose the set of coefficients \(\{p_j\}_{j=1}^c\) to minimize the variance whenever the test-taker tries to maximize the expected value. We assume that the rational test-taker will exclude answer options that she knows to be wrong (effectively decreasing the value of i), and then provide an answer comprising n of the remaining options.

3.1 Necessary and sufficient conditions to suppress guessing

The strategic choices that the test-taker makes will depend on her knowledge, which is unknown to the test-maker at the time of test construction. Assume that faced with a question item, the test-taker uses her knowledge to narrow the feasible options down to i alternatives. We want to ensure that she provides a partial answer comprising not more and not fewer than these i alternatives, since otherwise an element of chance unrelated to her knowledge is introduced in the score (increasing the test-retest variance).

The deterministic function S(in) gives the expected score when indicating by guesswork n options among i unknown as correct and we want to ensure that the rational strategy is to mark all of these i options and not only gamble on a subset of, say, \(m < i\) options. We also want to ensure that the rational strategy precludes inclusion of options that the test-taker knows to be wrong, which would lead to a scoring function with some particularly undesirable properties. Therefore, we require that \(S(i, i) > S(i, m)\) for \(m < i\) (penalizing the gamble of answering more narrowly than warranted by partial knowledge) as well as \(S(i,i) > S(i+1,i+1)\) for \(i<c\) (penalizing the “hedging-your-bets” strategy of answering more inclusively than warranted).

The first inequality leads to the condition:

$$\begin{aligned} S(i, i)> S(i, i - 1)> \ldots > S(i, 1) \end{aligned}$$
(6)

A natural way to satisfy it is to write the recursive relation:

$$\begin{aligned} p_i = \frac{(i - 1)p_{i-1}}{i} + \epsilon _i,\ \textrm{for}\ i = 1, \ldots , c \end{aligned}$$
(7)

where \(\{\epsilon _i\}\) are positive constants. This recursive relation can be rewritten in closed form as,

$$\begin{aligned} p_k = \sum _{j=1}^k \frac{j \epsilon _j}{k}. \end{aligned}$$
(8)

The second inequality then leads to the requirement,

$$\begin{aligned} \frac{k + 1}{k} \sum _{j=1}^k j \epsilon _j > \sum _{j=1}^{k+1} j \epsilon _j \end{aligned}$$
(9)

where \(c > k \ge 1\). This imposes an upper limit on the values of \(\{\epsilon _j\}\); however, in practice, we will consider only small values and will not need to pay any explicit heed to this constraint.

The ideal test-retest variance due to guessing is eliminated for the rational or risk-averse test-taker with all scoring functions for which the expected value of a guess beyond the test-taker’s knowledge is negative, and inequalities (6) and (9) provide this condition. However, in the actual testing situation, the only variance observed is that among the test-takers, and this variance is not only due to random guessing, since it includes also differences in measured ability in the test-taker population. The latter variance should be conserved in the test situation. We shall therefore narrow down our choice for \(p_i\) further.

3.2 Relation between item and latent score

Assume a test-taker knows a fraction f of the material (distractors and keys; f is directly proportional to the extent of the test-taker’s factual knowledge in the domain tested). The expected value of the score that she will get for an item, relative to a blank answer for which she is awarded \(p_c\) points, is then \(\sum _{i}^{c-1} (p_i - p_c) P_i(f, c)\) but in order for the points awarded to reflect the expected knowledge, this sum should equal f (or, at any rate, f times a constant). We thus have an equation,

$$\begin{aligned} \sum _{i=1}^{c-1} (p_i - p_c) P_i(f, c) = f \end{aligned}$$
(10)

Since \(P_i(f, c)\) is polynomial of degree c, one obtains from this equation a linear system of c equations and \(c-1\) unknowns by comparing coefficients in f. Since the system is overdetermined, we conclude that a perfectly linear correlation between the expected test score and our model for the latent score is unattainable for a multiple-choice test. This is the conclusion we drew earlier in a preceding section, but repeated here on more formal grounds.

To make headway, we pursue an alternative (albeit approximate) approach in which we consider that the test-taker knows a fraction g/c of the answer options for an item, where \(g = 1,\ldots , c - 1\), and then compute the expected score relative to the blank answer under the following assumption: if the key is known (which it is with probability g/c), the test-taker answers correctly for \(p_1\) points; otherwise, she answers partially correctly for \(p_{c-g}\) points. In this case, requiring that the expected score with respect to the blank answer equals the knowledge possessed, one obtains a linear system of \(c - 1\) equations of the form,

$$\begin{aligned} \frac{c - g}{c} (p_{c-g} - p_c) + \frac{g}{c} (p_1 - p_c) = \frac{g}{c} \end{aligned}$$
(11)

If one then sets \(p_c = 1/c\), the solution becomes the scoring function of Zapechelnyuk [44] – derived independently, apparently, by Otoyo and Bush [32] on heuristic grounds – where in our notation \(\epsilon _i = 0\) for \(i > 1\) and \(\epsilon _1 = 1\). The fact that \(\epsilon _i = 0\) for \(i > 1\) implies a weak violation of inequality (6), meaning both that the risk-neutral test-taker may guess (increased random variance) and that the latent score of the risk-averse test-taker may be underestimated.

Since keeping the relation of the score with the fraction of knowledge possessed as linear as possible is desirable, one might inquire about other choices of \(p_c\). One can easily verify that values of \(p_c < 1 / c\) will violate the condition that \(\epsilon _i > 0\) even further, but values of \(p_c > 1/c\) lead to compliance (\(\epsilon _i > 0\)) as long as the upper limit for \(p_c\) implied by inequality (9) is respected. Moreover, the point difference with respect to the blank answer remains the same, but the relative point gaps between consecutive types of answers change. Thus, in reality, the precise value of \(p_c > 1/c\) will be dictated by the psychology of the test-takers (e. g., degree of risk aversion or risk seeking).

4 Comparison of scoring functions for test items

Having established our formalization of the rational test-taker and her latent score, we shall investigate different scoring functions for the test items for purposes of illustration and of comparison. These different functions correspond to different scores awarded for different types of answers to one and the same item, the type of answer in turn being dictated by the rational test-taker that we consider in our formalization. For brevity, we will not consider every example which can be found in the literature, even if the mathematical approach is general enough. In the following, we let m denote the number of distractors known to the test-taker and c the total number of answer options on the multiple-choice test item. Nota bene, we reuse the symbol S from the previous section to denote the scoring function, but will consider it as a stochastic function of c and f, rather than as a deterministic function of c and n.

In the first scoring function considered, corresponding to the typical “number correct” scheme, the test-taker is awarded \(p_1\) points if either the key or all of the distractors are known. If not, \(p_1\) points are awarded with a probability of \(1/(c-m)\). This situation corresponds to the rational test-taker guessing whenever in doubt, and answering correctly whenever certain. In our formalization, the first two moments of this score function are

$$\begin{aligned} E[S_{\textrm{NC}}(f, c)] = p_1 \sum _{i=1}^c \frac{P_i(f, c)}{i} \end{aligned}$$
(12)

and

$$\begin{aligned} E[S_{\textrm{NC}}(f, c)^2] = p_1^2 \sum _{i=1}^c \frac{P_i(f, c)}{i} \end{aligned}$$
(13)

where “NC” stands for “number correct”.

In the second scoring function we consider, corresponding to a modified Zapechelnyuk (MZ) scoring function (with \(p_c > 1/c\)), points are awarded according to the following procedure:

  • If either the key or all of the distractors are known, the test-taker is awarded \(p_1\) points.

  • If the key is not known, the test-taker is awarded \(p_{c-m}\) points.

The set \(\{p_i\}\) is defined by solving the system of equations implicit in Eq. (11) with \(p_c > 1/c\). The precise value of \(p_c\) has no effect on the behavior of our formalized, rational test-takers as long as it is greater than 1/c. In formulae, we have

$$\begin{aligned} E[S_{\textrm{MZ}}(f, c)] = \sum _{i=1}^c p_i P_i(f, c) \end{aligned}$$
(14)

and

$$\begin{aligned} E[S_{\textrm{MZ}}(f, c)^2] = \sum _{i=1}^c p_i^2 P_i(f, c) \end{aligned}$$
(15)

for the first and second moments, respectively.

In the third one, proposed by Frandsen and Schwartzbach [20], \(p_1 \ln (c)\) points are given if either the key is known or all distractors are and \(p_1 \ln (c/(c-m))\) points otherwise. In the original formulation, there is a variable point penalty for incorrect answers designed to nullify the expected score of guessing. We do not need to consider it explicitly here because the test-takers we model are not risk-seeking. Mathematically, we have

$$\begin{aligned} E[S_{\textrm{FS}}(f, c)] = p_1 \sum _{i=1}^{c-1} \ln \left( \frac{c}{i} \right) P_i(f, c) \end{aligned}$$
(16)

and

$$\begin{aligned} E[S_{\textrm{FS}}(f, c)^2] = p_1^2 \sum _{i=1}^{c-1} \ln \left( \frac{c}{i} \right) ^2 P_i(f, c) \end{aligned}$$
(17)

The subscript “FS” is a mnenomic for “Frendsen-Schwarzbach”.

In the fourth one, corresponding to the popular “subset selection” (SS) scoring first proposed by Dressel and Schmid [18], \(p_1\) points are awarded if the test-taker knows the key or all of the distractors, and otherwise \(p_1(1-(c-m-1)/(c-1))\) points are awarded. It must be pointed out that this scoring function, as formulated in the original reference, strictly violates Eq. (6) on several points. It is therefore implicitly assumed that a penalty for incorrect answers is also included in the scoring function so as to negate the expected value of all guesses. Our formalization gives the moments,

$$\begin{aligned} E[S_{\textrm{SS}}(f, c)] = p_1 \sum _{i=1}^{c-1} \left( 1 - \frac{i - 1}{c - 1} \right) P_i \end{aligned}$$
(18)

and

$$\begin{aligned} E[S_{\textrm{SS}}(f, c)^2] = p_1^2 \sum _{i=1}^{c-1} \left( 1 - \frac{i - 1}{c - 1} \right) ^2 P_i \end{aligned}$$
(19)

under these slightly modified rules.

Finally, in the last one, \(p_1\) points are awarded for the correct answer, and a “very large” number is subtracted as a penalty for providing the wrong answer, meaning in our case that the test-taker will answer if she knows the answer with complete certainty and leave it blank otherwise for no points. For our purposes, we do not need to specify exactly how large this penalty is, but it is chosen at least as large as to make the expected score of a random guess on only two options negative. This gives very simple expressions for the first and second moments,

$$\begin{aligned} E[S_{\textrm{NG}}(f, c)] = p_1 P_1(f, c) \end{aligned}$$
(20)

and

$$\begin{aligned} E[S_{\textrm{NG}}(f, c)^2] = p_1^2 P_1(f, c) \end{aligned}$$
(21)

respectively. The subscript “NG” stands for “no guessing”.

A test is composed of several independent test items, which we consider to be drawn randomly from a set of keys and distractors. We denote the score on an item by S, and the total score is then simply the sum of the score on each item (and the total variance is the sum of the individual variances). In what follows, we simply deal with a test composed of a single item for simplicity but without any loss of generality. Results reported for a fixed value of f can be interpreted either as an average over all items on an infinite test for a single test-taker of specific ability, or as the average over an infinite number of test-takers of fixed ability on a fixed item.

4.1 Validity

It is usually desirable to have a scoring function that gives as linear as possible a relation with the latent score in order to enhance and facilitate comparisons between test-takers. It also means that the observed score (with respect to the underlying ability) is given on an interval, as opposed to ordinal, scale.

Therefore, we take deviations from the perfectly linear relation between the scoring function and the latent score to measure the extent of the “invalidity” of the observed score; contrariwise, the scoring function exhibits high validity if this relation is perfectly linear. In other words, we interpret a scoring function to be “valid” if on average it predicts the f-score linearly, no matter how large the dispersion around this prediction (which we take to be captured by the reliability and the measurement precision).

We will consider two measures of this linearity, since we know already that it will be compromised for values of f close to unity (although to different extents for different scoring functions). Our first index is the linear correlation coefficient, which is population-independent; we also introduce a second one, a coefficient of validity corresponding in mathematical form closely with the reliability coefficient, both being computed from the observed variances in the test-taker population.

4.1.1 Linear correlation with latent score

Whereas the rank correlation between E(S) and f is unity for all of the considered scoring functions (ensuring they are valid for rank sorting and constitute at least a true ordinal scale), the linear Pearson correlation coefficient differs slightly between them. Keeping \(c = 4\) as our test case, the calculated results are reported in Table 1. In all cases, the greatest deviations from linearity are observed for f close to unity (not shown), which is expected because higher-order polynomial terms in Eq. (10) become important only at large f, being naturally suppressed for small f.

Table 1 Calculated Pearson product-moment correlation coefficients between observed and latent scores rounded to four decimal places for the different scoring functions with \(c = 4\)

4.1.2 Coefficient of validity

Here we introduce a coefficient of validity along the same lines as the coefficient of reliability in the next section, that is, one which is computed from population variances.

The variance in S on an item among test-takers of fixed ability f is given by

$$\begin{aligned} V[S(f,c)] = E[S(f, c)^2] - E[S(f, c)]^2 \end{aligned}$$
(22)

where \(E(\cdot )\) denotes an expectation value. V[S(fc)] is the variance for all test-takers of ability f, which is independent of the actual distribution of f. To obtain the population-averaged variance for all abilities, we integrate over the latent-score distribution,

$$\begin{aligned} \Sigma ^2(c) = \int _{0}^1 V[S(f,c)] \phi (f) \textrm{d}f \end{aligned}$$
(23)

where \(\phi (f)\) is the probability density function for f in the test-taker population. In the language of classical test theory, this variance is “error variance” (and not “true score” variance; vide infra) since the integrated variance stems from an individual score variance, V[S(fc)], that is non-zero even in a hypothetical population with no variance in the ability, f. In addition, we define the expected error as

$$\begin{aligned} E[R(f, c)] = E[S(f, c) - S(0, c)] - f E [S(1, c) - S(0, c)] \end{aligned}$$
(24)

which can be seen to vanish for all f only if S(fc) is linear in f, that is if \(E[S(f,c)] = fE[S(1,c)]\), and compute the variance of this expected error as

$$\begin{aligned} \sigma _\textrm{E}^2(c) = \int _0^1 E[ R(f, c) ]^2 \phi (f) \textrm{d}f - \left[ \int _0^1 E[R(f, c)] \phi (f) \textrm{d}f \right] ^2 \end{aligned}$$
(25)

across f for the different scoring functions as per above. The variance \(\sigma _{\textrm{E}}^2\) represents the “Platonic true score” error variance, in that it is the variance of the deviation of the expected score from the value linearly predicted by the underlying ability. We now define a validity coefficient as the proportion of the total error that is not “Platonic true score” error variance, i.e.,

$$\begin{aligned} \rho (c) = 1 - \frac{\sigma _{\textrm{E}}^2(c)}{ \Sigma ^2(c) + \sigma _\textrm{E}^2(c)} \end{aligned}$$
(26)

This coefficient is bounded between zero and unity and obtains its maximum when the prediction by the expectation never deviates from linearity. On the contrary, it obtains its minimum if there is no statistical uncertainty around erroneous predictions. This behavior agrees with the verbal definition given in Sect. 4.1.

Both variances are functions of the chosen f-distribution and to give arbitrary but clear indications of the effects of the different scoring functions, we consider a test with \(c=4\) and two different choices for \(\phi (f)\): one “broad” distribution (Distribution I), which we take to be the uniform distribution for \(f \in [0, 1]\), and one “narrow” distribution (Distribution II), which we take to be the normal distribution with mean \(E(f) = 0.6\) and standard deviation \(\sigma _f = 0.1\). Refer to Table 2 for the results. In general, the computed validity is smaller for Distribution I than for Distribution II and this decline in the accuracy (which is, however, not that substantial) is mainly a consequence of sampling the ability distribution for f close to unity.

Table 2 Computed validity coefficients for \(c = 4\) rounded to four decimal places for two different distributions of the ability: one broad and one narrow. See text for details

4.2 Reliability

The total variance for the observed score is the sum of the variance in Eq. (23), representing the contribution to the variance from test-takers of different ability, and the two terms,

$$\begin{aligned} \int _0^1 E[S(f, c)]^2 \phi (f) \textrm{d}f - \left[ \int _0^1 E[S(f, c)] \phi (f) \textrm{d}f \right] ^2 \end{aligned}$$

representing the “true score” variance, the “true score”, E[S(fc)], being simply the expectation value of the observed score [24]. Hence, the reliability coefficient, given as the ratio of true score variance to total variance is,

$$\begin{aligned} r = \frac{\int _0^1 E[S(f, c)]^2 \phi (f) \textrm{d}f - \left[ \int _0^1 E[S(f,c)] \phi (f) \textrm{d}f \right] ^2}{\int _0^1 V[S(f,c)]\phi (f) \textrm{d}f + \int _0^1 E[S(f, c)]^2 \phi (f) \textrm{d}f - \left[ \int _0^1 E[S(f,c)] \phi (f) \textrm{d}f \right] ^2} \end{aligned}$$
(27)

Note that this reliability coefficient is that which is computed as an average over parallel forms of the test with non-identical items; it is not a test-retest coefficient. Thus a random element is present whether or not the test-taker knows the keys and distractors on the parallel form, but even if non-identical, the items are still equivalent from the perspective of the model.

Under the same conditions as the computed validity coefficients, the calculated results are reported in Table 3. All of the partial credits scoring functions exhibit increased reliability with respect to both NC and NG scoring. An increased reliability for NG vis-à-vis NC scoring is also apparent, but it does not quite reach up the level of the partial credits models. The modified SS scoring function (with penalties added for guessing) exhibits the highest predicted reliability. It is to be stressed that without this important modification, its reliability would be lower, approaching that of NC.

Table 3 Computed reliability coefficients for a test item with \(c = 4\) rounded to four decimal places for two different distributions of the ability: one broad and one narrow. See text for details

The decrease in computed reliability when moving from the broad to the narrow ability distribution is a consequence of the diminished true score variance, leaving more of the variance to chance effects. For a perfectly homogeneous distribution where all the test-takers share the same ability, there is no true score variance at all and the reliability coefficient is zero. The highest reliabilities will be obtained for distributions that are weighted toward the upper level of f-ability, because then the chance effects are reduced.

4.3 Discriminatory power

Polytomously scored items allow a finer discrimination among the observed scores compared to dichotomous ones. In this section, we seek to quantify this added value. In the literature, the “discrimination” of a test item is usually taken to be its Pearson correlation with the total test score. This is not sufficient for our analysis. We want to quantify the probability that a single test-taker receives, on a single item, a fair score reflecting her latent ability as precisely as possible. To be more precise: for two abilities \(f_1\), \(f_2\) with \(f_2 = f_1 + \epsilon \), \(\epsilon \ge 0\), a general scoring function S(fc) will satisfy \(M[S(f_2, c)] = M[S(f_1, c)]\) for \(\epsilon \) sufficiently close to zero, where \(M(\cdot )\) designates the mode. These cases lead to loss of discrimination among test-takers and higher values of d (defined below) indicate less overall influence (in an average sense) of these errors on the score.

Let \(\pi (s; f, c)\) be the probability that a test-taker of ability f achieves at least a score s on an item with c options. Clearly, we have

$$\begin{aligned} \pi _{\textrm{NG}}(s; f, c) = P_1(f, c) \end{aligned}$$
(28)

and

$$\begin{aligned} \pi _{\textrm{NC}}(s; f, c) = P(f, c) \end{aligned}$$
(29)

independently of s for \(s > 0\) for any scoring function that does not award partial marks (the probability is unity for \(s = 0\)). With the MZ scoring function and no guessing, we have

$$\begin{aligned} \pi _\textrm{MZ}(s; f, c) = \left\{ \begin{array}{l r} \sum _{i=1}^c P_i(f, c), &{} s \in [p_c, p_{c-1}) \\ \sum _{i=1}^{c-1} P_{i}(f, c), &{} s \in [p_{c-1}, p_{c-2}) \\ \vdots &{} \vdots \\ P_1(f, c), &{} s \ge p_1 \end{array} \right. \end{aligned}$$
(30)

and the cases for FS and SS follow analogously.

Denote the point value of a blank answer by \(p_\textrm{blank}\). Then we use the integral

$$\begin{aligned} d = \int _0^1 {\textrm{d}}f \frac{ \pi (f \delta p + p_\textrm{blank}; f, c) - \pi (f \delta p + p_\textrm{blank} + \Delta ; f, c)}{\Delta } \phi (f) \end{aligned}$$
(31)

as a measure of the discriminatory power of the scoring function. Here \(\delta p\) is the point difference between a fully correct and blank answer (usually this is \(p_1 - p_\textrm{blank}\) in the terminology of this paper, but for the FS rule, it is \(p_1 \ln c\)) and \(\Delta \) is a sensitivity parameter which we arbitrarily set \(\Delta = \delta p / 10\) for the results reported; d diverges as \(\Delta \rightarrow 0\) rendering any comparison impossible in this limit, but smaller values of \(\Delta \) lead to larger differences in d between the scoring functions. Results are given in Table 4.

Table 4 Discrimination index, d, rounded to two decimals for \(\Delta = \delta p / 10\) and \(c = 4\) for two different ability distributions: one broad and one narrow. See text for details

As seen, the results are very sensitive to the underlying distribution, with a complete reversal of the ranking of the partial credits models between Distributions I and II. The small value for MZ for the narrow distribution centered on \(f = 0.6\) reflects its poor discriminatory power around \(f \approx 0.6\). In fact, this scoring function is most discriminatory for \(f < 0.5\) and becomes essentially dichotomous above that.

4.4 Relative precision

The precision of the score reflects the information gained, in the sense that one is certain that the obtained score is correct [31], and this statistic is distinct from the reliability which measures the extent to which the obtained scores for the same test-taker in independent measurements are correlated. We compute the relative precision by normalizing the standard deviation at fixed ability by the expectation of the score, thus

$$\begin{aligned} \text {relative uncertainty} = \frac{\sqrt{V[S(f, c)]}}{E[S(f, c)] - p_\textrm{blank}(c)} \end{aligned}$$

where V[S(fc)] is given in Eq. (22) and \(p_\textrm{blank}(c)\) represents the score of a blank answer. This procedure yields a number that is more aptly termed “relative uncertainty” than “relative precision”, since larger values correspond to less precision. Averaged results over Distributions I and II are given in Table 5.

Table 5 Population-averaged relative uncertainties of the observed scores on a test item with \(c=4\) for two different ability distributions. See text for details

As opposed to the case of the reliability, the precision is increased for the narrower distribution with respect to the broad uniform distribution. This is mostly a consequence of the reduced chance variation for test-takers of higher ability. Indeed, for the test-taker of ability \(f=1\), the computed precision is perfect. As inferred from the results, in the lower-range of ability, the NG scoring function yields the most precise relative measurements, but this advantage is lost with more knowledgeable test-takers. Across both distributions, the modified SS rule is arguably the most precise. Note also that the large relative uncertainties (of the order of 30-50%) reported here are for a single test item. Unlike the reliability above, as the number of items is increased, the relative uncertainty decays asymptotically as an inverse square-root toward zero within the model.

5 Conclusion

Having considered a model where the test-taker is presumed to possess a definite set of “facts” that are reflected in the distractors and keys of the multiple-choice test, we have compared different scoring functions for the kind of multiple-choice test items for which partial knowledge is expected to contribute to the test-taking strategy. We should also point out that the formal model that we have presented, and which is the basis for our analysis, relies on an axiomatic representation of factual knowledge and the probability of solving a multiple-choice test item. The approach is hence fundamentally different from polytomous Rasch [29, 2] and item-response models [43], which rely on statistical fitting, but similar to knowledge space theory [17, 16] in its epistemological assumptions, of which it can be considered a limiting special case. We stress that the partial credits scoring functions investigated are intended to suppress guessing a priori; they hence differ fundamentally from any approach [38,39,40,35] which attempts to account for guessing a posteriori. These a posteriori methods are inherently unreliable whenever sample sizes are small and they are hence unsuitable for general classroom assessments, and limited to large-scale assessments.

Like Frary [21] and Chica and Tárrago [12] before us, we find that compared to the NC scoring function which does not suppress guessing, the computed reliabilities are vastly improved for all scoring functions that penalize incorrect answers; moreover, the partial credits scoring functions all exhibit further improvements in reliability over the dichotomously scored item. The argument for using at least some form of partial credits scoring functions is thus strong. Moreover, taking irrationality and risk-seeking behavior into account, Espinosa and Gardeazabal [19] find that the penalty for incorrect answers should exceed the typical one which is usually taken to precisely negate the effect of blind guessing for dichotomous items. It is likely that the same applies also to partial credits scoring, but all of the scoring functions considered in this work are readily adaptable in this regard: for instance, for the MZ scoring function, this means that \(p_c\) should be chosen sufficiently large.

While we have eschewed to take into account irrationality, risk-taking and unawareness of the extent of one’s own knowledge as factors in test-taker behavior, we can consider the results as a mathematical limiting case. Real-world testing situations will presumably approach this limit if the test-takers are clearly instructed that “guessing” in the sense of identifying more than one possible answer leads to an expected increase of their score, relative to guessing on a single answer. It is not altogether unlikely that this removes the “ethical dilemma” that apparently keeps some test-takers from guessing even when explicitly encouraged to do so [38]. Moreover, students have been found highly compliant with instructions not to guess even in the absence of any penalty scoring [14]. If both the score-optimal strategy and the instructions align, it is difficult to imagine anything but stronger compliance, especially if students are encouraged to be honest about their partial knowledge.

With regard to more facets than just the reliability of the test, we find that there is no single clearly superior scoring function among the ones tested, each having its own merits: the NG scoring function produces scores with the highest linear correlation with the underlying construct; the modified SS rule yields the highest reliability and precision (its unmodified form does not suppress guessing); the FS rule yields the most consistent item-level discrimination across the ability range, and so on. This explains the proliferation of different scoring functions in the literature. The virtues of the Zapechelnyuk scoring function [44], as regards the reliability, have been noted by Otoyo and Bush [32] in an empirical study comparing it to traditional NC scoring. Here, we replicate this finding theoretically, but also report that other partial credits scoring functions should yield even higher reliabilities, a finding which should be confirmed empirically. We also note that it seems more appropriate for scoring low-ability populations, on account of its low discriminatory power at the higher end of the f-distribution.