The Dependence of Chance-Corrected Weighted Agreement Coefficients on the Power Parameter of the Weighting Scheme: Analysis and Measurement

We consider the dependence of a broad class of chance-corrected weighted agreement coefficients on the weighting scheme that penalizes rater disagreements. The considered class encompasses many existing coefficients with any number of raters, and one real-valued power parameter defines the weighting scheme that includes linear, quadratic, identity, and radical weights. We obtain the first-order and second-order derivatives of the coefficients with respect to the power parameter and decompose them into components corresponding to all pairs of different category distances. Each component compares its two distances in terms of the ratio of observed to expected-by-chance frequency. A larger ratio for the smaller distance than the larger distance contributes to a positive relationship between the power parameter and the coefficient value; the opposite contributes to a negative relationship. We provide necessary and sufficient conditions for the coefficient value to increase or decrease and the relationship to intensify or weaken as the power parameter increases. We use the first-order and second-order derivatives for corresponding measurement. Furthermore, we show how these two derivatives allow other researchers to obtain quite accurate estimates of the coefficient value for unreported values of the power parameter, even without access to the original data. Supplementary Information The online version contains supplementary material available at 10.1007/s11336-022-09881-7.


Introduction
Agreement coefficients measure the extent to which raters agree when subjectively classifying items into mutually exclusive and exhaustive categories. Examples include the classification of communications based on content, images based on visible aspects, and diagnoses of patients. High rater agreement indicates that the obtained categorical data are reproducible. In contrast, low rater agreement means that the raters interpreted the items or categories differently, jeopardizing the validity of subsequent analyses.
Due to limited choice options, raters may guess the category without knowing the actual category, implying that some rater agreements occur by chance. Because agreements by chance do not provide intrinsic value, agreement coefficients usually aim to exclude them (Banerjee et al. 1999;Janson and Olsson 2001). Different ways to correct for chance agreement have resulted in various agreement coefficients.
In addition to nominal (unordered) categories, many settings involve classification into ordinal (ordered) categories, such as 5-point rating scales. Ordinal categories require the researcher to choose both a suitable agreement coefficient and a weighting scheme that assigns partial credit to rater disagreements. The amount of credit (or penalization) for disagreements typically depends on the distance between the chosen categories, but many options exist to capture this dependence.
where N is the number of items, C ≥ 3 is the number of categories, R ≥ 2 is the number of raters, and R i,c is the number of raters who assign item i to category c, with C c=1 R i,c = R. Furthermore, w c,c defines the weights for pairwise rater (dis)agreements, where c is the category chosen by the first rater, andc is chosen by the second rater: w c,c = 1 if c =c, and 0 ≤ w c,c < 1 if c =c (i.e., full credit if the two raters agree, and partial or no credit if they disagree). Because of symmetric weights, w c,c = wc ,c for all c =c. Expression (1) is consistent with Gwet (2014) and Van Oest and Girard (2021); it reduces to Fleiss (1971) if w c,c = 0 for all c =c.
We consider weighting schemes that penalize rater disagreements based on the distance between the chosen categories, with power parameter γ (Vanbelle 2016;Warrens 2013Warrens , 2014: These weights become linear if γ = 1, become quadratic if γ = 2, become radical if γ = .5, and converge to identity weights (i.e., w c,c = 0 for all c =c, implying unweighted agreement) if γ → 0. The weighting scheme does not award credit to rater disagreements as γ approaches zero, whereas it becomes increasingly lenient as γ increases. A smaller power parameter (e.g., radical weights) is appropriate for situations where even minor rater disagreements are serious. For example, different teachers may need to do grading and independent regrading of student exams, where one-step deviations are part of the game, but larger deviations quickly become unacceptable. Conversely, a larger power parameter (e.g., quadratic weights) is appropriate if only major rater disagreements are problematic. Furthermore, linear weights are suitable if no obvious arguments exist to deviate from penalization in proportion to the distance of disagreement. For the sake of illustration, we write out the weights W = (w c,c ), defined by (2), for C = 5 categories and different values of the power parameter:  By substituting the weights (2) into observed weighted agreement (1) and subsequently working out the brackets and using that (|c −c| /(C − 1)) γ = 0 if c =c, we obtain Symmetry in the numerator of (3) regarding c andc (and thusc < c andc > c inc = c) implies where is the observed fraction of cases (i.e., combinations of items and rater pairs) for which one rater chooses c and the other rater choosesc < c. In words, (4) states that observed weighted agreement A w equals one minus the total observed weighted disagreement.

Chance-Corrected Weighted Agreement
We consider a broad class of chance-corrected weighted agreement coefficients: where the category proportions in the chance correction sum to one and are greater than zero: The first part of (7) is logical consistency; the two other parts hold if, for example, all categories are chosen at least once by one of the raters if p c = q c for all c or chosen by any two raters (not necessarily for the same item) if p c = q c . The class of coefficients, defined by (6) and (7), is general. For R = 2 raters, it includes weighted versions of Cohen's kappa (Cohen 1960(Cohen , 1968) and Scott's pi (Scott 1955). For R ≥ 2 raters, it includes weighted versions of Fleiss' kappa (Fleiss 1971), the uniform prior coefficient (Van Oest 2019; Van Oest and Girard 2021), and the S-coefficient (Bennett et al. 1954;Brennan and Prediger 1981;Warrens 2014). Table 1 provides the operationalizations of p c and q c for these coefficients. Although the class of coefficients does not include Krippendorff's alpha (Gwet, 2014, p.88), this coefficient converges to the weighted Fleiss' kappa as the number of items N increases. Thus, these coefficients usually provide similar values (Gwet 2014). Another excluded coefficient is the weighted kappa for R ≥ 3 raters (Mielke et al. 2007). This coefficient considers the R-dimensional category combinations from all raters together (instead of rater pairs) but is equivalent to a weighted version of Conger's kappa, where expected weighted agreement C c=1 C c=1 w c,c p c qc becomes the corresponding average across all rater pairs, with rater-specific category proportions p c and qc (Conger 1980;Warrens 2012b). Furthermore, the class of coefficients excludes Gwet's AC2 (Gwet, 2014, p.89), which replaces C c=1 C c=1 w c,c p c qc by a substantially different expression.
We substitute the weights (2) and observed weighted agreement (4) into coefficient (6): where C c=1 C c=1 p c qc = 1 in the middle step of (8) due to property (7), and is the fraction of cases (i.e., combinations of items and rater pairs) expected by chance for which one rater chooses c and the other rater choosesc < c, with E c,c > 0 because of (7). Coefficient I w in (8) equals one minus the ratio of the total observed and expected weighted disagreements.
To facilitate further interpretation, we recall the definition of O c,c in (5) and define as the observed fraction of cases for which the categories c andc < c, chosen by two raters, are l ∈ {1, . . . , C − 1} steps apart; we put brackets around subscript l to emphasize that it refers to the distance between categories. Analogously, recalling the definition of E c,c in (9), we define as the fraction of cases expected by chance for which the categories c andc < c, obtained from two raters, are l steps apart. Table 1 provides the exact expressions of (11) for the coefficients. We note that E (l) > 0 because of (7). Using (10) and (11), we rewrite the chance-corrected weighted agreement coefficient (8) in terms of all possible category distances and their observed and expected frequencies:

First-Order Derivative
The first-order derivative of I w with respect to γ describes coefficient susceptibility, that is, the direction and degree to which the coefficient value changes as the power parameter of the weighting scheme increases. Theorem 1. The first-order derivative of coefficient I w in (12) with respect to power parameter γ in weighting scheme (2) is i,c = 1 if rater 1 assigns item i to category c (zero else), and R i,c = 1 if rater 2 assigns item i to category c (zero else). For later reference: Theorems 1 and 2, and Corollaries 1, 2, 3, 7, and 8 apply to all coefficients, with Corollaries 2 and 8 pertaining to C = 3 categories.
Proof. Differentiating (12) with respect to γ yields where the notation uses different indices l, m, and s in the summations to allow for combining. Using that − ln(l) + ln(m) = ln(m/l) and that ln(m/l) = 0 if l = m, we obtain Next, we decompose l = m into l < m and l > m: Using that m = 1 is infeasible if l < m, and rewriting yields the result, completing the proof.
The first-order derivative in (13) is a weighted sum taken over all pairs of different category distances m and l < m. As reflected by the term (O (l) /E (l) ) − (O (m) /E (m) ), each component compares its smaller distance l with its larger distance m in terms of the ratio of observed to expected-by-chance frequency. Because all other terms in (13) are strictly positive, it holds that d I w /dγ > 0 (i.e., coefficient I w is increasing in power parameter γ ) if the ratio of observed to expected-by-chance frequency tends to decrease as categories are farther apart; that is, if mostly O (l) /E (l) > O (m) /E (m) for l < m. However, violations are allowed because of the compensatory structure in the weighted sum.
In (13), a component's comparison of category distances becomes more important in shaping the first-order derivative as (i) the ratio of the larger versus smaller distance increases (via ln(m/l)), and (ii) these two distances capture greater shares of the total expected weighted disagreement 562 PSYCHOMETRIKA across all distances (via the fractions l γ E (l) /( s γ E (s) )). Furthermore, the latter implies that a component's importance increases as its two distances l and m are more likely to occur by chance (via E (l) and E (m) ), and these distances l and m increase, where higher values of γ play a reinforcing role (via l γ and m γ ). For example, in a two-rater contingency table, the elements far from the main diagonal increasingly determine how the coefficient value responds to changes in the power parameter as this parameter increases. An implication is that the relationship between the power parameter and the coefficient value can be non-monotonic, as changes in γ trigger shifts in the importance of components that compare different category distances, with possibly opposite contributions via the signs of (O (l) /E (l) ) − (O (m) /E (m) ). Furthermore, the log-ratio of category distances ln(m/l) in (13) implies that the degree of coefficient susceptibility is often higher in settings with more categories (i.e., higher C). The reason is that ln(m/l) tends to take higher values as C increases, magnifying the effects of the comparisons

Conditions for Direction of Coefficient Susceptibility
We obtain a necessary and sufficient condition from (13) in Theorem 1:

Corollary 1a. As power parameter γ in weighting scheme (2) increases, the chance-corrected weighted agreement coefficient I w in (12) increases if and only if
Corollary 1b. As power parameter γ in weighting scheme (2) increases, the chance-corrected weighted agreement coefficient I w in (12) decreases if and only if Proof. We have C−1 s=1 s γ E (s) > 0 due to property (7), so we may ignore these C−1 s=1 s γ E (s) terms that do not determine the sign of d I w /dγ in (13). Thus, d I w /dγ > 0 is equivalent to the simpler condition in Corollary 1a, and d I w /dγ < 0 is equivalent to the condition in Corollary 1b.
As before, each component compares its two category distances m and l < m in terms of the ratio of observed to expected-by-chance frequency, and components comparing larger distances become relatively more important as the power parameter increases (via l γ and m γ ).
For C = 3 categories, the necessary and sufficient condition in Corollary 1 becomes particularly simple:

Corollary 2a. As power parameter γ in weighting scheme (2) increases in settings with three categories, the chance-corrected weighted agreement coefficient I w in (12) increases if and only if
Corollary 2b. As power parameter γ in weighting scheme (2) increases in settings with three categories, the chance-corrected weighted agreement coefficient I w in (12) decreases if and only if Proof. The only feasible pair of category distances with l < m for C = 3 categories corresponds to l = 1 and m = 2. Because ln(m/l) l γ m γ E (l) E (m) > 0, substituting l = 1 and m = 2 into Corollary 1 yields Corollary 2.
Corollary 2 implies that the relationship between γ and I w is always monotonic (either increasing or decreasing) for C = 3 categories. The direction is determined by whether the ratio of observed to expected-by-chance frequency is greater for combinations of categories that are one step apart or two steps apart.
Furthermore, we obtain a sufficient condition that extends the sufficient condition by Warrens (2013) beyond two raters: Corollary 3a. As power parameter γ in weighting scheme (2) increases, the chance-corrected weighted agreement coefficient I w in (12) increases if the ratio O (l) /E (l) is decreasing in the category distance l.

Corollary 3b. As power parameter γ in weighting scheme (2) increases, the chance-corrected weighted agreement coefficient I
is decreasing in l. Similarly, the necessary and sufficient condition for Thus, the relationship between power parameter γ and coefficient I w is monotonic if the ratio of observed to expected-by-chance frequency is monotonic in the category distance. The sufficient condition in Corollary 3 is both necessary and sufficient in three-category settings due to Corollary 2.

Conditions for Direction of Coefficient Susceptibility: Weighted S-coefficient
It is instructive to apply the three corollaries to the weighted S-coefficient, which assumes that all C categories are equally likely to occur by chance; that is, p c = q c = 1/C, c = 1, . . . , C, and hence E (l) = 2(C − l)/C 2 (see Table 1). We note that C − l is the number of category combinations (c,c) with distance l; that is, satisfying c −c = l. By applying Corollary 1 to the weighted S-coefficient, we obtain a necessary and sufficient condition for this coefficient:

Corollary 4a. As power parameter γ in weighting scheme (2) increases, the weighted Scoefficient increases if and only if
Proof. Substituting E (l) = 2(C − l)/C 2 into Corollary 1 and ignoring the positive constant term 2/C 2 (that does not affect the sign) yields the result.
As before, this condition considers a weighted sum taken over all pairs of different category distances m and l < m. As reflected by the term , each component compares its smaller distance l with its larger distance m in terms of the average observed frequency per category combination (c,c). Component importance increases as more category combinations (c,c) have the corresponding distances l and m (via C − l and C − m), the ratio of the larger versus smaller distance increases (via ln(m/l)), and these distances themselves increase, where higher values of γ play a reinforcing role (via l γ and m γ ).
Analogous to Corollary 2, the necessary and sufficient condition for the weighted S-coefficient in Corollary 4 becomes particularly simple if there are only C = 3 categories:

Corollary 5b. As power parameter γ in weighting scheme (2) increases in settings with three categories, the weighted S-coefficient decreases if and only if O
Proof. The only feasible pair of category distances with l < m for C = 3 categories corresponds to l = 1 and m = 2. Because ln(m/l) l γ m γ (C − l) (C − m) > 0, substituting l = 1, m = 2, and C = 3 into Corollary 4 yields Corollary 5.
Corollary 5 implies that the relationship between the power parameter and the weighted Scoefficient is always monotonic (either increasing or decreasing) for C = 3 categories. The direction is determined by whether the average observed frequency of the two category combinations (c = 2,c = 1) and (c = 3,c = 2), with distance one, is greater than the observed frequency of category combination (c = 3,c = 1), with distance two, or not.
Furthermore, we obtain a sufficient condition for the weighted S-coefficient that extends a sufficient condition by Warrens (2014) beyond two raters: Corollary 6a. As power parameter γ in weighting scheme (2) increases, the weighted Scoefficient increases if the average observed frequency O (l) /(C − l) is decreasing in distance l.

the necessary and sufficient condition for
Similarly, the necessary and sufficient condition for Thus, the relationship between the power parameter and the weighted S-coefficient is monotonic if the average observed frequency per category combination is monotonic in the category distance. The sufficient condition in Corollary 6 is both necessary and sufficient in three-category settings due to Corollary 5.

Second-Order Derivative
The second-order derivative of I w with respect to γ helps describe change in coefficient susceptibility, that is, whether the coefficient's susceptibility to the power parameter of the weighting scheme intensifies or weakens as this parameter increases.
Theorem 2. The second-order derivative of coefficient I w in (12) with respect to power parameter γ in weighting scheme (2) is Proof. Starting from the first-order derivative in (13), we first note that Using this result and (13), we obtain Rewriting the term in the first set of accolades yields the result, completing the proof.
Like the first-order derivative in (13), the second-order derivative in (14) is a weighted sum taken over all pairs of different category distances m and l < m. As reflected by the term (m) ), each component compares its smaller distance l with its larger distance m in terms of the ratio of observed to expected-by-chance frequency. The only difference between the two derivatives is the componentwise multiplier in the first set of accolades in (14). If this multiplier is positive, the component affects the first-order and second-order derivatives in the same direction: The component's comparison of category distances increasingly shapes the relationship between power parameter γ and coefficient I w (i.e., susceptibility tends to intensify) as γ increases. Conversely, a negative multiplier implies opposite effects on the two derivatives: The component's influence reduces (i.e., susceptibility tends to weaken) as γ increases. Because the multiplier is increasing in the distances l and m, components comparing larger category distances become relatively more influential than components comparing smaller category distances as the power parameter increases. Furthermore, settings with more categories (i.e., higher C) are more likely to have substantial multipliers, making large changes in coefficient susceptibility more likely. The reason is that the term ln(lm/s 2 ) in the multiplier in (14) can take more extreme values as C increases.

Conditions for Change in Coefficient Susceptibility
A necessary and sufficient condition follows from (13) and (14) in Theorems 1 and 2: Corollary 7. As power parameter γ in weighting scheme (2) increases, the susceptibility of coefficient I w in (12) to γ intensifies if the first-order derivative d I w /dγ in (13) and the secondorder derivative d 2 I w /(dγ ) 2 in (14) have the same sign and weakens if these two derivatives have opposite signs. Equivalently, as γ increases, coefficient susceptibility intensifies if have the same sign and weakens if these two expressions have opposite signs.
Proof. This follows from the definitions of first-order and second-order derivatives. We obtain the two expressions in the second half of Corollary 7 by ignoring the C−1 s=1 s γ E (s) terms that are always positive and do not determine the signs of d I w /dγ in (13) and d 2 I w /(dγ ) 2 in (14).
We expect that coefficient susceptibility often weakens as power parameter γ increases. The reason is that, especially for high γ , large category distances s correspond to both high values of s γ and negative values of ln(lm/s 2 ) in C−1 s=1 ln(lm/s 2 )s γ E (s) , whereas small category distances s correspond to both low values of s γ and positive values of ln(lm/s 2 ). Thus, large distances s tend to make large negative contributions to C−1 s=1 ln(lm/s 2 )s γ E (s) , whereas small distances s tend to make only small positive contributions, triggering opposite signs in Corollary 7. As this mechanism for weakening coefficient susceptibility becomes increasingly strong for higher values of the power parameter, coefficient susceptibility ultimately converges to zero. Furthermore, the mechanism is more prominent if the values of E (s) remain substantial for large s, so large category distances are relatively likely to occur by chance.
For C = 3 categories (with the only feasible combination being l = 1 and m = 2), we can write the second-order derivative in (14) as a multiple of the first-order derivative in (13): where the fraction in (15) is the multiplier in (14) for l = 1 and m = 2. Equivalently, We obtain the following result for three-category settings, where the relationship between the power parameter and the coefficient value is monotonic due to Corollary 2: Corollary 8. As power parameter γ in weighting scheme (2) increases in settings with three categories, the chance-corrected weighted agreement coefficient I w in (12) becomes more susceptible to γ until γ * = ln(E (1) /E (2) )/ln(2) . Next, I w becomes less susceptible.
Thus, if C = 3, there is a value of the power parameter for which the chance-corrected weighted agreement coefficient I w is most susceptible to this parameter, and this value γ * is easy to compute. Furthermore, the weighted S-coefficient is most susceptible to linear weights: Corollary 9. As power parameter γ in weighting scheme (2) increases in settings with three categories, the weighted S-coefficient becomes more susceptible to γ until γ * = 1. Next, the weighted S-coefficient becomes less susceptible.
The contour plot in Figure 1   the figure, triggering high E (s) for large s in Corollary 7). Conversely, coefficient susceptibility intensifies monotonically at least up to quadratic weights (i.e., γ * > 2) if the distribution is strongly unimodal, entailing a middle category that is substantially larger than the smallest corner category (in the top, left, and right parts of the figure, triggering low E (s) for large s in Corollary 7). Furthermore, coefficient susceptibility is most extreme between identity and quadratic weights (i.e., 0 < γ * < 2) if the category proportions p 1 , p 2 , and p 3 are somewhat balanced. We note that small changes in the distribution of category proportions may substantially affect the value of γ * if one of the corner categories strongly dominates (in the figure's bottom left and right parts).

Descriptive Measures of Coefficient Susceptibility
Based on the preceding analysis, we propose descriptive measures that summarize various aspects of coefficient susceptibility for any data set with rater-based classifications. As shorthand notation, we denote the first-order derivative by D 1 (γ ) and the second-order derivative by D 2 (γ ).
First, researchers may use the first-order derivative D 1 (γ ) to describe how the coefficient value reacts to changes in the chosen value of power parameter γ . The sign of D 1 (γ ) reveals the direction of dependence; the absolute value quantifies the degree of coefficient susceptibility. The measure D 1 (γ ) is the change in the value of coefficient I w in response to a small change in γ , expressed as a multiple of this change in γ . This measure of coefficient susceptibility is invariant to the amount of curvature that is present in the relationship between γ and I w . Interpretation is most straightforward for settings in which the relationship between γ and I w is (almost) linear. For example, D 1 (γ ) = .10 would mean that the value of I w changes by (approximately) .10 if γ changes by one point. Settings with substantial curvature in the relationship between γ and I w require combining D 1 (γ ) with a measure of curvature, that is, change in coefficient susceptibility as γ changes (as discussed below). In settings with three categories, researchers may also report D 1 (γ * ), that is, the first-order derivative evaluated at the value of the power parameter for which coefficient I w is most susceptible, where Corollary 8 defines γ * . This measure provides a tight upper bound for the degree of coefficient susceptibility over the entire range of γ ; it is independent of the chosen value of the power parameter.
Second, researchers may use the ratio D 2 (γ )/D 1 (γ ) to describe the amount of curvature that is present in the relationship between power parameter γ and coefficient I w , or equivalently, to describe the change in coefficient susceptibility as γ changes (Pratt 1964). A positive sign of D 2 (γ )/D 1 (γ ) indicates that the coefficient value changes more when γ increases than when γ decreases (i.e., susceptibility intensifies as γ increases), whereas a negative sign indicates the opposite (i.e., susceptibility weakens as γ increases). The absolute value of D 2 (γ )/D 1 (γ ) quantifies the change in coefficient susceptibility as a fraction of the amount of susceptibility that is present. Thus, D 2 (γ )/D 1 (γ ) is a scaled measure that is invariant to the actual degree of susceptibility. For settings with three categories, this ratio measure reduces to (16), a simple closedform expression. We note that D 1 (γ ) and D 2 (γ )/D 1 (γ ) complement each other: The former describes coefficient susceptibility independent of the amount of curvature in the relationship between γ and I w ; the latter describes curvature (or change in susceptibility) independent of the amount of susceptibility.
We illustrate the measures D 1 (γ ) and D 2 (γ )/D 1 (γ ) for linear weights by considering 31 data sets from the literature. Contingent on our library access, these data sets originate from two literature reviews by Warrens (2013) and Warrens (2014), supplemented by other data sets that we obtained by checking lists of references and additional well-known studies of interrater agreement. In addition to D 1 (γ ) and D 2 (γ )/D 1 (γ ), we show D 1 (γ * ) and the corresponding value of γ * for all data sets with C = 3 categories. For settings with R = 2 raters, we implement Cohen's kappa, which is the most frequently used coefficient in practice. For settings with R ≥ 3 raters, we implement Fleiss' kappa, proposed in the literature as an easy generalization of Cohen's kappa beyond two raters (although it generalizes Scott's pi rather than Cohen's kappa).
We provide Ox source code as online supplementary material on the journal's website. Ox is free of charge for academics, and downloads are available at doornik.com/download.html (Doornik 2007). The default value of the power parameter in the Ox source code is γ = 1, but users can easily adjust its value. To compute the measures D 1 (γ ), D 2 (γ )/D 1 (γ ), D 1 (γ * ), and γ * , we recommend editing the first data set in the Ox source code if users wish to implement Cohen's kappa to analyze a data set in C × C contingency table format. Alternatively, we recommend editing the last data set in the Ox source code if the data set is an N × C table containing the rater frequencies R i,c . In the latter case, the calculations assume that the coefficient is Fleiss' kappa, which coincides with Scott's pi if there are R = 2 raters. Users can run the source code in the Ox editor by first clicking on "Modules" and next clicking on "Ox." This automatically prints all computed statistics. Table 2 shows that D 1 (γ ) > 0 for 27 out of 31 data sets, confirming that coefficient values usually increase as the power parameter of the weighting scheme increases. Furthermore, the 570 PSYCHOMETRIKA Table   2.
Measures of coefficient susceptibility and their interpretation for 31 data sets from literature.  R. VAN OEST 571 degree of coefficient susceptibility is often high: |D 1 (γ )| ≥ .10 for 17 data sets, .05 ≤ |D 1 (γ )| < .10 for 7 data sets, and |D 1 (γ )| < .05 for only 7 data sets. For the data sets with C = 3 categories (implying monotonicity), the values of γ * vary substantially, confirming that the point until which coefficient susceptibility intensifies depends on the specific data set. As anticipated, the number of categories C correlates strongly with the degree of coefficient susceptibility |D 1 (γ )|, with a correlation coefficient of .59. Furthermore, the values of D 2 (γ ) /D 1 (γ ) show that coefficient susceptibility often weakens as the power parameter increases (18 data sets), although it sometimes intensifies (5 data sets) or is almost constant (8 data sets). As anticipated, substantial change in coefficient susceptibility (i.e., curvature in the relationship between γ and I w ) occurs most often when the number of categories C is high, with a correlation coefficient of .48.

Coefficient Values for Unreported Values of Power Parameter
Beyond interpretation of coefficient susceptibility in terms of positive or negative and intensifying or weakening, the summary measures D 1 (γ ) and D 2 (γ ) /D 1 (γ ) help researchers obtain (approximate) coefficient values for unreported values of the power parameter. For example, Table  2 shows the results for linear weights only. Still, we can use these results to estimate the coefficient values for other choices, such as identity, radical, or quadratic weights. Furthermore, these estimates do not require access to the original data because the coefficient I w computed at γ , the first-order derivative D 1 (γ ), and the ratio D 2 (γ ) /D 1 (γ ) are sufficient statistics. The second-order Taylor series (i.e., quadratic) approximation of coefficient I w , computed for an alternative power parameter value γ + γ , is given by where the right-hand side is the heuristic value, and γ is the change in the power parameter. Table 3 shows the deviations between the actual value of I w (γ + γ ) and the corresponding heuristic value in (17) for all 31 data sets in Table 2. We consider γ = 1 and γ = 2 for the original value of the power parameter (i.e., linear and quadratic weights). Next, we change the value of γ : γ = −1, γ = −.5, γ = .5, and γ = 1, resulting in 2 × 4 = 8 different scenarios. The heuristic is generally accurate. The mean absolute deviation based on all 31 × 8 cells in Table 3 is .002, and the corresponding mean absolute percent deviation is .99, approximately one percent. More precisely, the mean absolute deviation across the 31 data sets is .005 or less for each of the eight scenarios, and this deviation is .001 or less for the four scenarios with either γ = −.5 or γ = .5. Similarly, the maximum absolute deviation across 30 of the 31 data sets is .011 or less for each of the eight scenarios. Furthermore, the absolute deviation remains modest for the excluded data set from Maclure and Willett (1987), with C = 12 categories and extreme levels of susceptibility and curvature. This deviation is .019 for γ = 1 and γ = 1, .016 for γ = 2 and γ = −1, and .011 or less for the other six scenarios. 572 PSYCHOMETRIKA Table 3.  Cohen (1960) 2 3 .001 .000 − .000 −.001 .001 .000 −.000 −.000 Fleiss (1971) 2 3 .001 .000 − .000 −.001 .000 .000 −.000 −.000 Fleiss (1971) 2 3 −.002 − .000 .000 .002 −.001 −.000 .000 .001 Guggenmoos-Holzmann and Vonk (1998) 2 3 .000 − .000 .000 .000 −.000 −.000 .000 .000 Spitzer and Fleiss (1974) 23−.001 − .000 .000 .000 .000 .000 −.000 −.001 Sim and Wright (2005) 23−.002 − .000 .000 .003 −.003 −.000 .000 .002 Sim and Wright (2005) 24 Cohen's kappa if R = 2 raters, and Fleiss' kappa if R ≥ 3 raters.

Example: Two Raters and Three Categories
We consider a contingency table from Cohen (1960) that corresponds to the first row of Table  2. Table 4 reproduces the observed and expected fractions of items for which the choices by the two raters result in the corresponding category combination. As there are three categories, the maximum possible category distance (i.e., distance to the main diagonal) is two. For Cohen's kappa with three categories, Corollaries 1, 2, 3, 7, and 8 apply (see Table 1).

Example: Four Raters and Five Categories
We consider a data set from Gwet (2014, p.372), included in Table 2. As there are five categories, the maximum possible category distance is four. As we consider Fleiss' kappa with more than three categories, Corollaries 1, 3, and 7 apply (see Table 1). The top part of Table 5 shows the observed and expected fractions of cases (i.e., combinations of items and rater pairs) for the four distances l = 1, . . . , 4, together with the corresponding ratios O (l) /E (l) . Because O (l) /E (l) is decreasing in the category distance l, coefficient I w increases monotonically as power parameter γ increases (Corollary 3).
We compute the differences (O (l) /E (l) ) − (O (m) /E (m) ) for all pairs of different category distances, m = 2, . . . , 4, and l < m, resulting in six pairwise comparisons of distances. Next, we compute the components of the first-order derivative in (13) for linear weights and these six pairs of distances. For example, the first component, with l = 1 and m = 2, is .155 .253 .011 .000 E (l) .110 .378 .062 .162 O (l) /E (l) 1.409 .670 .187 .000 The first-order derivative is the sum of the six components: D 1 (γ ) = .015 + .009 + .009 + .048+.079+.002 = .163. Thus, the relationship between the power parameter and the coefficient value is indeed positive (Corollary 1). As shown in the middle part of Table 5, the coefficient's strong susceptibility to the power parameter is mainly due to comparisons involving the maximum possible category distance m = 4 that never actually occurred in the data (i.e., O (4) /E (4) = .000). In particular, the comparisons of the two category distances l = 1 and l = 2 with distance m = 4 (implying large distance ratios m/l in (13)) contribute substantially to D 1 (γ ); they account for more than 80 percent of the total. The remaining distance l = 3 is unlikely to occur by chance and therefore plays only a minor role: E (3) = .062. Using a similar decomposition for the secondorder derivative in (14), we obtain D 2 (γ )/D 1 (γ ) = −.234. As the two derivatives D 1 (γ ) and D 2 (γ ) have opposite signs, coefficient susceptibility weakens as the power parameter increases (Corollary 7).
Using (12) and Table 5, we compute the coefficient value for linear weights: As before, we use the second-order Taylor series heuristic in (17) to obtain estimates of I w for identity and quadratic weights, based on the computed measures for linear weights. Gwet (2014, p.150) reported that Fleiss' kappa with identity weights equals .410 for the considered data set. The heuristic in (17) yields essentially the same coefficient value when moving from linear to identity weights (i.e., γ = 1 and γ = −1): Similarly, Gwet (2014, p.150) reported that Fleiss' kappa with quadratic weights equals .734. When moving from linear to quadratic weights (i.e., γ = 1 and γ = 1), the heuristic yields

Discussion
A frequently expressed concern is that different weighting schemes to penalize rater disagreements may result in substantially different coefficient values and conclusions about whether the categorized data are reproducible (De Raadt et al. 2021). The present study considered how a power parameter, commonly applied to define weighting schemes, affects a broad class of chance-corrected weighted agreement coefficients. We allowed for a continuum of infinitely many weighting schemes: Researchers may decide to follow popular choices (e.g., linear, quadratic, or identity weights) or use some other value of the real-valued power parameter that would better fit their data context. For example, they may decide that chosen categories that are one step apart should receive a specific weight and adjust the power parameter value to obtain the corresponding weighting scheme.
The "optimal" weighting scheme depends on the specific study context (Cohen 1968;Gwet 2014). Linear weights are a natural choice when there are no obvious arguments to deviate from penalization in proportion to the distance of rater disagreement. However, stricter weighting schemes (e.g., radical weights) may be better if even relatively small disagreements are serious, and more lenient weighting schemes (e.g., quadratic weights) may be better if only rather large disagreements are problematic. Although researchers may choose a specific weighting scheme for good reasons related to their data context, the choice is subjective and likely prone to abuse. For example, empirical studies most commonly use lenient quadratic weights (Vanbelle 2016). However, these studies usually provide little or no justification for this choice (Crewson 2005). Therefore, it is important to understand how the values of chance-corrected weighted agreement coefficients respond to changes in the power parameter. Furthermore, empirical studies should become more transparent.
The present study addressed these issues. First, we obtained theoretical results that help understand when and why chance-corrected weighted agreement coefficients are susceptible to the power parameter and in which direction. We provided necessary and sufficient conditions for the coefficient value to increase or decrease and the relationship to intensify or weaken as the power parameter increases. Furthermore, we decomposed these conditions into components that pairwise compare different category distances based on the ratio of observed to expectedby-chance frequency. For example, a larger ratio for the smaller distance than the larger distance 577 contributes to a positive relationship between the power parameter and the coefficient value. We showed that the relationship is monotonic if the number of categories equals three or the ratio of observed to expected-by-chance frequency is monotonic in the category distance.
Second, we provided closed-form expressions for the first-order and second-order derivatives of chance-corrected weighted agreement coefficients with respect to the power parameter. We proposed the first-order derivative and the ratio of both derivatives as measures to quantify coefficient susceptibility and change in susceptibility as the power parameter changes. These summary measures give researchers a quick impression of the amount and type of dependence, such as positive or negative susceptibility and intensifying or weakening patterns. For example, suppose coefficient susceptibility turns out to be only moderate. In that case, the authors could use the measures to show that the obtained coefficient value does not strongly depend on the chosen weighting scheme. We found that positive but weakening coefficient susceptibility is most common. Thus, the coefficient value usually increases as the power parameter increases but tends to become more stable for higher values of the power parameter. For example, moving from identity to linear weights (i.e., from γ = 0 to γ = 1) likely triggers a larger change in the coefficient value than an equal-sized step from linear to quadratic weights (i.e., from γ = 1 to γ = 2).
Third, we showed how other researchers could use the coefficient value and derivatives for the reported value of the power parameter to obtain quite accurate estimates of the coefficient value for unreported values of the power parameter. These calculations are quick and easy (e.g., in Microsoft Excel or using a hand calculator), and they do not require access to the original data set. Especially this last property is valuable: Empirical studies often do not show their underlying data, particularly in settings with more than two raters, where the data no longer fit within a simple contingency table. Ideally, authors of empirical studies provide both arguments to justify their chosen weighting scheme and the derivative-based measures to allow others to recompute the coefficient value for flexible other choices of the power parameter.
The literature has proposed reference tables to interpret the values of chance-corrected (weighted) agreement coefficients in terms of high or low (Landis and Koch 1977). However, there is a broad consensus that more lenient weighting schemes require stricter thresholds, making such tables less useful (e.g., Warrens 2013). Although a lenient weighting scheme may not need stricter thresholds if it would fit the specific data context, correction is necessary if solid arguments for such a weighting scheme are lacking. Unfortunately, the literature offers little or no guidance on which stricter thresholds are appropriate to correct. Therefore, an alternative approach could be to apply the original thresholds to a recomputed coefficient value for a less lenient weighting scheme that the outside researcher considers more appropriate. Our proposed measures allow for recalculations that are usually accurate in the first two decimals. Furthermore, these measures help identify whether the issue of correcting is essential for the considered data set, which would be the case if the degree of coefficient susceptibility is high.
Although we considered a broad class of chance-corrected weighted agreement coefficients, future research could obtain the first-order and second-order derivatives and related conditions for coefficients with different structures. Examples include the weighted kappa for R ≥ 3 raters and Gwet's AC2. Furthermore, future research could extend the analysis to coefficient versions that allow for missing data, where raters may classify different subsets of items (Gwet 2014;Van Oest and Girard 2021). Another avenue for future research pertains to the drivers of coefficient susceptibility. For example, the present study found that agreement coefficients are often more susceptible to the power parameter in settings with more categories (i.e., higher C). However, other drivers may be present too. Relatedly, we considered 31 data sets from the literature. Future research could include more data sets to improve the representativeness of the sample and present meta-analytic generalizations. 578 PSYCHOMETRIKA Funding Open access funding provided by Norwegian Business School.

Conflict of interest
The author has no conflicts of interest to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.