1 Introduction

1.1 Family of gamma, delta, and tau

For estimating the association between two ordinal-scaled variables, two approaches are usually used: the one based on covariance and the one based on probability. The approach using covariance includes such widely used estimators as product–moment correlation coefficient (PMC; Pearson 1896 onwards) originally meant for two observed continuous variables and polychoric correlation (RPC; Pearson 1900, 1913) for two unobservable latent variables. Within the approach using probability, the most commonly used measures of association come from the family that includes Kendall’s tau-a and tau-b (Kendall 1938, 1948), Goodman–Kruskal gamma (G; Goodman and Kruskal 1954), and Somers’ delta (D; Somers 1962). Also, such rarely used estimators as Kim’s dy.x (Kim 1971) and Wilson’s e (1974) are part of this family. As a family of coefficients, it is usually referred to either as tau family (e.g., Kendall 1948; Kendall and Gibbons 1990), gamma family (e.g., Van der Ark and Van Aert 2015; Woods 2007), or delta family (e.g., Newson 2006; Metsämuuronen 2020a, b). Kendall’s tau-a can be taken as the mother of the other estimators because, when there are no tied pairs between the variables, they all equal with tau-a (see Kendall and Gibbons 1990; Newson 2006). This article studies, specifically, the characteristics of G and shows that, under certain conditions, G is a special case of D although sometimes the opposite is suggested (e.g., Kvålseth 2017).

1.2 Some known characteristics of G within the measurement modelling settings

G and partial G (Goodman and Kruskal 1954; Davis 1967) are used, although rarely, it seems, in measurement modelling settings (see, e.g., Forthmann et al 2020; Kreiner and Christensen 2009; Nielsen and Santiago 2020). However, G has some favourable characteristics related to these settings. Namely, in comparison with the wider used estimator PMC, G appears to be robust against many sources of so called systematic mechanical error (SME) causing mechanical underestimation of association (Metsämuuronen 2021). SME is a new concept related to estimators of association referring to fact that the estimates of association include error that is mechanical in nature and it occurs in a systematic manner in certain estimators of association in varying quantity. For example, while PMC is notably affected by such sources of SME as restriction of range in general (see the literature in, e.g., Mendoza and Mumford 1987; Sackett and Yang 2000; Sackett et al 2007 and simulations by Martin 1973, 1978; Olsson 1980), item difficulty, the number of categories in the item and in the score, and the distributions of the latent variables, G produces estimates that are SME-free in all of these conditions (see simulations in Metsämuuronen 2021). In practical terms, while PMC always underestimates the true association because of mechanical reasons, G reflects the true association without loss of information caused by the mechanical reasons regardless of the sources of SME mentioned above. Hence, G appears to be a surprisingly interesting coefficient in resisting SME in the estimation of association. However, although G is accurate in reflecting the true association between two variables, it has two opposite challenges: obvious underestimation when the number of categories in the variables gets high and possible inflation magnitude of the estimates. These are discussed below.

Because being based on probability, the embedded linear nature in G in comparison with the estimators with trigonometric nature such as PMC makes G underestimate the association between an item and the score in an obvious manner (see Metsämuuronen 2021). The phenomenon is similar with Somers’ D (Metsämuuronen 2020b; Göktaş and İşçi 2011), and it can be explained by Greiner’s relation (Greiner 1909) discussed by Kendall (1948), Newson (2002), and Metsämuuronen (2020b, 2021). Greiner’s relation states that, with continuous variables X and Y, tau-a = G = D and, then, PMC between variables X and Y equals \(\rho_{XY} = \sin \left( {\tfrac{1}{2}\pi \times tau_{a} } \right)\)\(= \sin \left( {\tfrac{1}{2}\pi \times G} \right)\)\(= \sin \left( {\tfrac{1}{2}\pi \times D} \right)\). Consequently, with continuous variables, the values by \(\rho_{XY}\) of 0, \({{ \pm 1} \mathord{\left/ {\vphantom {{ \pm 1} {\sqrt 2 }}} \right. \kern-\nulldelimiterspace} {\sqrt 2 }}\), and \(\pm 1\) as examples correspond with the values by G and D of 0, \({{ \pm 1} \mathord{\left/ {\vphantom {{ \pm 1} 2}} \right. \kern-\nulldelimiterspace} 2}\), and \(\pm 1\), respectively. Then, except for the extreme values \(\pm 1\) and 0, the magnitude of the estimates by \(\rho_{XY}\) tends to be greater than those by G = D. While it is known that D underestimates association of an item and the score when the number of categories in the item exceeds three (Metsämuuronen 2020b), G seems to underestimate association when the number of categories exceeds four (Metsämuuronen 2021).

Another discussed challenge in G is its possible inflation in the estimates. Kvålseth (2017) notes that the estimates by G “may be highly inflated making it incomparable with other measures such as the frequently used Kendall's tau-b” (p. 10,582; see also Higham and Higham 2019; Masson and Rotello 2009). Other researchers (e.g., Freeman 1986; Gonzalez and Nelson 1996; Metsämuuronen 2021) propose that there is no inflation per se in G but, instead, a different logic of using tied pairs when computing probability. This matter is discussed later with formulae. Partly, the apparent inflation may be caused by the hidden directional nature of G discussed in this article.

Based on simulation results, Metsämuuronen (2021) has collected some advances of G in the measurement modelling settings in comparison with item–total correlation (Rit), item–rest correlation (Rir), polychoric correlation (RPC) and D. First, G reaches the extreme values − 1, 0, and + 1 accurately, while Rit and Rir cannot reach the extremes of correlation, and RPC can reach the extreme values only approximatively. Because of being based on ranks, G is also more robust for extreme observations, nonlinearity, and difficulty levels of the item than Rit and Rir. Hence, with binary items, G tends to produce estimates that underestimate item discrimination power less than the estimates by Rit and Rir. Also, G is applicable and accurate also with non-normal, sparse, or small datasets and crosstables, while the applicability and accuracy of the estimation result of the Rit and Rir depend on the number of categories in the variables. Second, G (as well as RPC) is accurate in reflecting the latent perfect association between the item and the score unlike Rit, Rir, and D; the latter behave unpredictable and they underestimate the latent perfect association in an obvious manner. While both G and RPC reflect accurately the perfect latent association, the calculation of RPC requires complex procedures and specific software packages while G is reasonably easy to calculate, even manually, in practical test settings. Also, while RPC refers to unknown, unreachable, and hypothetical variables that are difficult to use in further research, G utilizes the known composite of items and score. Many of these advances are related to SME; in comparison with PMC, both G and RPC appeared to be resistant to many sources of SME (Metsämuuronen 2021). We may add here also the result from this article: G has a logical directional nature from the measurement modelling viewpoint; it indicates how well the latent trait (score) explains the responses in the test items. Newson (2002) also points that the interpretation of G is straightforward, and it may be easier to interpret in words than PMC.

1.3 An empirical note on the identity of G and D

Traditionally, G is taken as a symmetric measure because it produces only one value (e.g., IBM 2017; Sheskin 2011; Sirkin 2006; Wholey et al. 2015) while D is unambiguously a directional measure producing three options: a symmetric estimate and two directional estimates where either of the variables is dependent and the other is independent. The latter directions are usually named as “row dependent” and “column dependent” related to the analysis of two-way contingency tables. Hence, G and D are, fundamentally, different estimators of association. However, it is easy to produce a pair of variables where the estimates by G and one of the directions of D are identical—the only requirement is that one of the variables do not have tied cases (see later Table 1).

An unpublished empirical note of the identity of G and D was made when reanalysing the published dataset by Metsämuuronen (2020a); the original analysis did not concern G. When reanalysing the dataset using G, with all variables, the estimates by G and a specific direction of D were identical. If the empirical dataset shows the identity, it can be derived also in an algebraic manner. This article shows this identity.

1.4 Research question

When knowing that, under certain conditions, G = D ≤ 1, a relevant question is, which of the options of Somers’ D is G: “row-dependent” or “column-dependent” or “symmetric”? In what follows, the forms of G and D are presented first. By comparing the formulae, it is also shown that, under certain conditions, both G and D are related to Jonckheere–Terpstra test statistic. Then, algebraic reasons for the direction of G are discussed. Finally, numerical examples of G and different varieties of D are given using a simulation with real-world datasets.

2 Forms of G and D

2.1 Measurement model latent to gamma and delta

The basic results in the article are general and applicable to any two general variables with ordinal or interval scale and, then, g and X refer to the variable with the narrower and wider scale, respectively. However, the applications in the article are discussed within the measurement modelling settings where the variables (item g and score or measurement scale X) are dependent because both are related to the common latent trait (θ).

Assume that the observed values in g with r = 1, …, R and X with c = 1, …, C distinctive ordinal or interval categories, and R << C, share the common latent trait (θ).Footnote 1 Hence, the higher the latent trait is the more probable it is to reach higher score (X) and, simultaneously, more probably a higher value (or the correct answer) in a test item (g). The threshold values of θ for each category in g are denoted by \(\upsilon_{i}\) and for each category in X by \(\tau_{j}\). Then, g and X are related to θ so that observed value of the item is g = xi, if \(\upsilon_{i - 1}\) ≤ θ < \(\upsilon_{i}\), i = 1, 2,…, R and the observed value of the score X = yj, if \(\tau_{j - 1}\) ≤ θ < \(\tau_{j}\), j = 1, 2, …, C, and \(\upsilon_{0} = \tau_{0} = - \infty\) and \(\upsilon_{R} = \tau_{C} = + \infty\). Figure 1 illustrates the model with a binary g (R = 2); nij refers to the number of cases in in cell i, j.

Fig. 1
figure 1

A latent variable \(\theta\) manifested in two ordinal variables g and X

2.2 Population form of γ

G estimates the probability γ that two randomly chosen cases have the same order in two variables (e.g., Van der Ark and Van Aert 2015). Let variables g and X be sampled jointly from a bivariate population with the joint distribution π. The joint probabilities are denoted by πrs. Let (x1, y1), (x2, y2), …, (xN, yN) be a set of observations of the joint random variables g and X. The pairs of observation (xl, yl) and (xh, yh), where l < h, are concordant if the order for both elements agree, that is, if xl < xh and yl < yh or xl > xh and yl > yh. Similarly, the pairs are discordant when xl < xh and yl > yh or xl > xh and yl < yh simultaneously. If xl = xh or yl = yh, the pairs are tied; those are neither discordant nor concordant.

The probability that two randomly chosen test-takers have the same order in both g and X is denoted by πP (from the traditional symbol for concordant pairs P) and the probability that two randomly chosen test-takers have a different order in g and X is denoted by πQ (from the traditional symbol for discordant pairs Q), and

$$ \pi_{P} = \sum\limits_{r = 1}^{g} {\sum\limits_{c = 1}^{X} {\pi_{rc} } } \left( {\sum\limits_{i > r} {\sum\limits_{j > c} {\pi_{ij} + \sum\limits_{i < r} {\sum\limits_{j < c} {\pi_{ij} } } } } } \right) $$
(1)

and

$$ \pi_{Q} = \sum\limits_{r = 1}^{g} {\sum\limits_{c = 1}^{X} {\pi_{rc} } } \left( {\sum\limits_{i > r} {\sum\limits_{j < c} {\pi_{ij} + \sum\limits_{i < r} {\sum\limits_{j > c} {\pi_{ij} } } } } } \right) $$
(2)

(Van der Ark and Van Aert 2015). Using these symbols, the latent γ is defined as

$$ \gamma = \frac{{\pi_{P} - \pi_{Q} }}{{\pi_{P} + \pi_{Q} }}. $$
(3)

2.3 Sample forms of G, D, and tau-b

The sample forms of G and D are usually expressed using the concepts of concordance (P; the number of pairs of observations into the same direction) and discordance (Q; the number of pairs into the opposite directions) observed in variables g and X. We define

$$ \begin{gathered} C_{ij} = \sum\limits_{h < i} {\sum\limits_{k < j} {n_{hk} } } + \sum\limits_{h > i} {\sum\limits_{k > j} {n_{hk} } } , \hfill \\ D_{ij} = \sum\limits_{h < i} {\sum\limits_{k > j} {n_{hk} } } + \sum\limits_{h > i} {\sum\limits_{k < j} {n_{hk} } } , \hfill \\ P = \sum\limits_{i,j} {n_{ij} C_{ij} } , \hfill \\ Q = \sum\limits_{i,j} {n_{ij} D_{ij} } , \hfill \\ \end{gathered} $$
(4)

where nij is the number of cases in the cell ij of the two-way contingency table. Both P and Q include (the same) tied pairs; the number of these tied pairs is denoted respectively by Tg and TX when g and X are considered. The number of all combinations of pairs related in the direction that “g given XFootnote 2 is

$$ D_{r} = D_{g} = N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} = \left( {P + T_{g} } \right) + \left( {Q + T_{g} } \right) = P + Q + 2T_{g} $$
(5)

and for “X given g”,

$$ D_{c} = D_{X} = N^{2} - \sum\limits_{i = 1}^{C} {\left( {n_{i}^{2} } \right)} = \left( {P + T_{X} } \right) + \left( {Q + T_{X} } \right) = P + Q + 2T_{X} . $$
(6)

The quantities of P and Q in Eq. (4) are double of those we usually see in the textbooks (e.g., Metsämuuronen 2017; Siegel and Castellan 1988). Although calculating P and Q in practical settings is easier when only half of the directions (and then doubling them) are considered (see later Table and related discussion), the notation in Eq. (4) makes it possible to estimate the asymptotic standard errors strictly (e.g. Agresti 2010; Goodman and Kruskal 1979; Metsämuuronen 2021; see also Appendix).

The sample form of G estimates the latent γ and proportions PQ with those pairs for which we know the direction and hence, the tied pairs are excluded:

$$ G = \frac{P - Q}{{P + Q}}. $$
(7)

The asymptotic standard error (ASE) used when computing the confidence interval is

$$ ASE_{1} (G) = \frac{4}{{\left( {P + Q} \right)^{2} }}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {QC_{ij} - PD_{ij} } \right)^{2} } } $$
(8)

and, under the hypotheses of independence used when computing the test statistics,

$$ ASE_{0} (G) = \frac{2}{{\left( {P + Q} \right)}}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {C_{ij} - D_{ij} } \right)^{2} } - \frac{1}{N}\left( {P - Q} \right)^{2} } $$
(9)

(e.g., IBM 2017; Agresti 2010) where P, Q, Cij, and Dij are as defined in Eq. (4).

The sample form of D proportions PQ with all possible pairs including also the tied pairs related to g or X, depending on the direction. In the case that X explains the order in g, that is, “g given X

$$ D\left( {g\left| X \right.} \right) = \frac{P - Q}{{D_{g} }} = \frac{P - Q}{{P + Q + 2T_{g} }}, $$
(10)

and in the case that g explains the order in X, that is, “X given g”,

$$ D\left( {X\left| g \right.} \right) = \frac{P - Q}{{D_{X} }} = \frac{P - Q}{{P + Q + 2T_{X} }}, $$
(11)

and generally, Tg ≠ TX. The sample form of the symmetric form of D is

$$ D\left( {sym} \right) = \frac{P - Q}{{\tfrac{1}{2}\left( {D_{g} + D_{X} } \right)}} = \frac{P - Q}{{P + Q + T_{g} + T_{X} }}. $$
(12)

When computing the confidence intervals, the ASEs for \(D\left( {g\left| X \right.} \right)\) and \(D\left( {X\left| g \right.} \right)\) are

$$ {\text{ASE}}_{1} \left( {D\left( {g\left| X \right.} \right)} \right) = \frac{2}{{D_{g}^{2} }}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {D_{g} \left( {C_{ij} - D_{ij} } \right) - \left( {P - Q} \right)\left( {N - n_{i} } \right)} \right)^{2} } } $$
(13)

and

$$ {\text{ASE}}_{1} \left( {D\left( {X\left| g \right.} \right)} \right) = \frac{2}{{D_{X}^{2} }}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {D_{X} \left( {C_{ij} - D_{ij} } \right) - \left( {P - Q} \right)\left( {N - n_{j} } \right)} \right)^{2} } } . $$
(14)

The corresponding ASEs under the hypotheses of independence are

$$ {\text{ASE}}_{0} \left( {D\left( {g\left| X \right.} \right)} \right) = \frac{2}{{D_{g} }}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {C_{ij} - D_{ij} } \right)^{2} - \frac{1}{N}\left( {P - Q} \right)^{2} } } $$
(15)

and

$$ {\text{ASE}}_{0} \left( {D\left( {X\left| g \right.} \right)} \right) = \frac{2}{{D_{X} }}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {C_{ij} - D_{ij} } \right)^{2} - \frac{1}{N}\left( {P - Q} \right)^{2} } } . $$
(16)

The form of the standard error of D(sym) is notably more complicated (see, e.g., IBM 2017) and it is not relevant for the latter part of the article. Hence, it is omitted here.

As a benchmark for G and D, the sample form of tau-b is

$$ tau - b = \frac{P - Q}{{\sqrt {D_{g} \times D_{X} } }} = \frac{P - Q}{{\sqrt {\left( {P + Q} \right)^{2} + 2\left( {P + Q} \right)\left( {T_{g} + T_{X} } \right) + 4\left( {T_{g} \times T_{X} } \right)} }}, $$
(17)

where we see that the lower magnitudes of the estimates by tau-b as well as \(D\left( {sym} \right)\) in comparison with G and directional Ds are expected because tau-b and \(D\left( {sym} \right)\) use the number of tied pairs in a rather extensive manner.

By comparing Eqs. (7), (10), (11), and (12) it is obvious that G gives us a more liberal approximation of the probability in comparison with D. Both ways of thinking the probability make sense. On the one hand, the logic in D is solid when we think it from the viewpoint of classic probability: the favourable cases are portioned with all cases (of pairs). On the other hand, the logic in G is the same as in the sign test (traced to Arbuthnott 1710; see Metsämuuronen 2017), and Wilcoxon signed-rank test (Wilcoxon 1945) where the sample size (related to the pairs) is adjusted by omitting the pairs where we do not know the direction. Hence, “this property [to restrict the calculation only to untied pairs in G] is neither a flaw nor a weakness” as pointed by Freeman (1986, p. 63; see also Gonzalez and Nelson 1996; Metsämuuronen 2021).

Notably, the discussion of the tied pairs can be more elaborated than above. Gonzalez and Nelson (1996), for example, separate the variables A and B as a predictor (p) and criterion (c) variables and, consequently, the tied pairs may be related to either on the predictor variable (Ap or Bp) with the number of paired cases designated as Tp, on the criterion variable (Ac or Bc) designated as Tc, or on both the predictor and the criterion variable designated as Tpc. Using these symbols, according to Gonzalez and Nelson (see also Freeman 1986), Somers’ D \(= {{\left( {P - Q} \right)} \mathord{\left/ {\vphantom {{\left( {P - Q} \right)} {\left( {P + Q + T_{c} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {P + Q + T_{c} } \right)}}\), Kim’s DX.g \(= {{\left( {P - Q} \right)} \mathord{\left/ {\vphantom {{\left( {P - Q} \right)} {\left( {P + Q + T_{p} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {P + Q + T_{p} } \right)}}\), and Wilson’s e \(= {{\left( {P - Q} \right)} \mathord{\left/ {\vphantom {{\left( {P - Q} \right)} {\left( {P + Q + T_{c} + T_{p} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {P + Q + T_{c} + T_{p} } \right)}}\) (see more estimators in Freeman 1986). Notably, Gonzalez and Nelson as well as Freeman simplify the set of estimators remarkably; factually, \({{\left( {P - Q} \right)} \mathord{\left/ {\vphantom {{\left( {P - Q} \right)} {\left( {P + Q + T_{c} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {P + Q + T_{c} } \right)}}\) = \(D\left( {X\left| g \right.} \right)\), Kim’s \(D_{X.g} = D\left( {g\left| X \right.} \right)\), and Wilson’s e = \(D\left( {sym} \right)\). The factual directionality in G shown below is not related to the position of the variables as a predictor or criterion variable but to the widths of the scales. Hence, this logic of notation by Freeman (1986) and Gonzalez and Nelson (1996) is not used in this article.

In what follows, the sample forms and the interpretation of G and D are discussed within the measurement modelling settings and their connection to Jonckheere–Terpstra test statistic and identity under certain conditions is noted.

3 Identity of G and D and their connection to Jonckheere–Terpstra test statistic

3.1 Jonckheere–Terpstra test statistic and rank‒polyserial correlation

Cureton’s rank‒biserial correlation coefficient (\(\rho_{RB}\); Cureton 1956; Wendt 1972) for the association between a binary item and ordinal score can be expressed using the Mann–Whitney U test statistic (Mann and Whitney 1947):

$$ \rho_{RB} = 2 \times \frac{{U_{gX}^{{{\text{Obs}}}} }}{{U_{gX}^{{{\text{Max}}}} }} - 1 = 2 \times \frac{{U_{1} }}{{n_{0} n_{1} }} - 1, $$
(18)

where \(U_{gX}^{{{\text{Obs}}}}\) is the observed U test value related to the higherFootnote 3 of the subsamples (l = 0 and h = 1) in g, \(U_{gX}^{{{\text{Max}}}}\) is the theoretical maximum value of U test, and n0 and n1 are the numbers of cases in the subsamples in g. \(U_{gX}^{Max} = n_{0} n_{1}\) implies the condition that all test-takers in the higher subsample h = 1 are ranked higher in X than the test-takers in the lower subsample l = 0.

Jonckheere–Terpstra test statistic (JT; Jonckheere 1954; Terpstra 1952) extends the directional U and its calculation procedure to polytomous cases (e.g., Metsämuuronen 2017; Siegel and Castellan 1988). Hence, logically, the following measure may be called rank–polyserial correlation (\(\rho_{RP}\)):

$$ \rho_{RP} = 2 \times \frac{{JT_{gX}^{{{\text{Obs}}}} }}{{JT_{gX}^{{{\text{Max}}}} }} - 1 = 2 \times \frac{JT}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} - 1, $$
(19)

where \(JT_{gX}^{Obs}\) and \(JT_{gX}^{Max}\) are the observed and maximal JT statistic. The characteristics of the measure are not discussed here although we note that JT statistic is embedded in the core of the measure. The core in \(\rho_{RP}\) is the probability measure \({{JT} \mathord{\left/ {\vphantom {{JT} {\sum\limits_{l < h}^{R} {n_{l} n_{h} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{l < h}^{R} {n_{l} n_{h} } }}\) ranging 0–1 and indicating the proportion of logically ordered observations in g after they are ordered by X. In Eq. (19), this measure is transformed, using a linear transformation of doubling and centring, to the same scale as the correlation ranging ‒1 to  + 1. With a binary g, \(\rho_{RB}\) is a special case of \(\rho_{RP}\).

3.2 Relation of D and JT

Consider the direction of conditions where “g given X”. Because of Eq. (5)

$$ P + Q = \left( {N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} } \right) - 2T_{g} , $$
(20)

The number of cases in the subsamples related to g and X are \(n_{i}\) and \(n_{j}\), respectively, and, then

$$ N = \sum\limits_{i = 1}^{R} {n_{i} } = \sum\limits_{j = 1}^{C} {n_{j} } . $$
(21)

Because of (21), the element \(N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)}\) can be manipulated as follows:

$$ N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} = \left( {\sum\limits_{i = 1}^{R} {n_{i} } } \right)^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} = \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} + 2 \times \sum\limits_{l < h}^{R} {n_{l} n_{h} } - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} = 2\sum\limits_{l < h}^{R} {n_{l} n_{h} } . $$
(22)

Hence, because of (10) and (22), \(D\left( {\left. g \right|X} \right)\) can be rewritten as

$$ D\left( {g\left| X \right.} \right) = \frac{P - Q}{{N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} }} = \frac{P - Q}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} } }}. $$
(23)

Because of Eqs. (20) and (22)

$$ Q = 2\sum\limits_{l < h}^{R} {n_{l} n_{h} } - \left( {P + 2T_{g} } \right). $$
(24)

Then, \(D\left( {\left. g \right|X} \right)\) can be rewritten as

$$ D\left( {\left. g \right|X} \right) = \frac{P - Q}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} = \frac{{P - \left( {2\sum\limits_{l < h}^{R} {n_{l} n_{h} } - \left( {P + 2T_{g} } \right)} \right)}}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} = \frac{{2\left( {P + T_{g} } \right) - 2\sum\limits_{l < h}^{R} {n_{l} n_{h} } }}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} = \frac{{\left( {P + T_{g} } \right)}}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} - 1. $$
(25)

Because of the definition in Eq. (5), including both positive and negative direction, the element \(\left( {P + T_{g} } \right)\) is two times the number of pairs in one direction. Remembering that JT equals the number of pairs in the same order including only the positive elements,

$$ P + T_{g} = 2 \times JT. $$
(26)

Then, because of Eqs. (23), (26), and (19), we note the identity of \(\rho_{RP}\) and D:

$$ D\left( {\left. g \right|X} \right) = \frac{P - Q}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} = 2 \times \frac{JT}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} } }} - 1 = \rho_{RP} , $$
(27)

that is,\(\rho_{RP}\) is a special case of Somers’ D so directed that “g given X”. Hence, in measurement modelling settings, Somers’ \(D\left( {\left. g \right|X} \right)\) strictly indicates the proportion of logically ordered tests-takers in the item after they are ordered by the score.

3.3 Relation of G and JT

Because of Eqs. (7) and (20)

$$ G = \frac{P - Q}{{P + Q}} = \frac{P - Q}{{\left( {N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} } \right) - 2T_{g} }} = \frac{P - Q}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} - 2T_{g} } }}. $$
(28)

Because of Eqs. (28) and (24), parallel to Eq. (25), we can write

$$ G = \frac{P - Q}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} - 2T_{g} } }} = \frac{{P - 2\sum\limits_{l < h}^{R} {n_{l} n_{h} } + P + 2T_{g} }}{{2\sum\limits_{l < h}^{R} {n_{l} n_{h} - 2T_{g} } }} = \frac{{P + T_{g} }}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} - T_{g} } }} - 1 $$
(29)

and because of Eq. (26)

$$ G = \frac{P - Q}{{P + Q}} = 2 \times \frac{JT}{{\sum\limits_{l < h}^{R} {n_{l} n_{h} - T_{g} } }} - 1. $$
(30)

This indicates that, in the measurement modelling settings, Goodman–Kruskal gamma can be interpreted as a slightly modified proportion of the logically ordered tests-takers in the item after they are ordered by the score while taking into account only those cases for which we know the order, that is, considering only the pairs without ties.

While the coefficient in Eq. (19) is called the rank–polyserial correlation coefficient, also the latter part in Eq. (30) could be used as \(\rho_{RP}\). However, the former estimator related to D (Eq. 27) gives a more conservative estimate while the latter related to G (Eq. 30) gives a more liberal estimate of the association between two ordinal-scaled variables.

3.4 Identity of G and D

Strictly from Eqs. (7), (10), (11), and (12) it is known that G = D when there are no tied pairs. Then, G has the identity of D under three general conditions irrespective of the distributions in the variables, difficulty level in variables, number of cases, number of categories in the variables, and number of ties in the single variables: (1) when either of the variables is or both are continuous, implying no tied pairs; (2) if X is not continuous but there are no ties in X, that is, when each test-taker gets unique score regardless the distribution in the item, and (3) when there are ties in X but there are no crossing observations between g and X, that is, when all the tied values in the score are related to the identical value in item. The last of the options appears to be important in understanding the direction in G.

From the direction of “g given X”, when \(T_{g}\) = 0, because Tg ≠ TX, and because of Eqs. (7) and (10),

$$ D\left( {g\left| X \right.} \right) = \frac{P - Q}{{P + Q + 2T_{g} }} = \frac{P - Q}{{P + Q}} = G. $$
(31)

Similarly, from the direction of “X given g”, when \(T_{X}\) = 0, because of Eqs. (7) and (11)

$$ D\left( {X\left| g \right.} \right) = \frac{P - Q}{{P + Q + 2T_{X} }} = \frac{P - Q}{{P + Q}} = G. $$
(32)

However, the condition of TX = 0 is possible only in the case of continuous variables causing Tg = TX = 0 and, then \(G = D\left( {X\left| g \right.} \right) = D\left( {g\left| X \right.} \right) = D\left( {sym} \right)\). The reason is that, excluding the continuous case, the condition of Tg = 0 or TX = 0 is true only when there are ties in the variables but there are no crossing observations between the two variables, that is, when all the tied values in one variable are related to an identical value in the other variable (see variables A2 and B2 in Table 1). This can happen only with the variable that has a wider scale because, in the variable with a shorter scale, there will always be at least two pairs that are tied with the variable with a wider scale. Hence, only the variable with the wider scale can be the one causing the condition of no ties (Tg = 0) irrespective of whether the variable is in row or in column or whether it is a predictor or criterion variable (see Gonzalez and Nelson 1996). Therefore, except the case of continuous variables implying Tg = TX = 0, when \(T_{g}\) = 0 and \(T_{X}\) ≠ 0,

$$ \begin{gathered} G = \frac{P - Q}{{P + Q}} = \frac{P - Q}{{P + Q + T_{g} }} \\ = D\left( {g\left| X \right.} \right) = G\left( {g\left| X \right.} \right) \ne D\left( {X\left| g \right.} \right).\\ \end{gathered} $$
(33)

Equation (33) means that, although usually taken as a symmetric measure, Goodman–Kruskal gamma is, in fact, a directional measure the same manner as is Somers D; G is directed so that the order in the variable with the wider scale explains the order in the variable with the narrower scale without the relation to rows and columns in the cross-tables. Numerical examples will clarify the phenomenon.

Notably, also, except the case of continuous variables when TX = Tg = 0, the ASEs of G and D are equal only in one direction. This is easy to show for the ASEs under the hypotheses of independency. Because of Eq. (5) and (6), \(D_{g} = P + Q + 2T_{g} \ne P + Q + 2T_{X} = D_{X}\). Because of Eqs. (9) and (15), knowing that TX = 0 can be obtained only with continuous variables without tied pairs, under any other condition when Tg = 0 \(\ne\) TX,

$$ \begin{gathered} ASE_{0} (G) = \frac{2}{{\left( {P + Q} \right)}}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {C_{ij} - D_{ij} } \right)^{2} - \frac{1}{N}\left( {P - Q} \right)^{2} } } , \\ = \frac{2}{{\left( {P + Q + 0} \right)}}\sqrt {\sum\limits_{i,j} {n_{ij} \left( {C_{ij} - D_{ij} } \right)^{2} - \frac{1}{N}\left( {P - Q} \right)^{2} } } , \\ = ASE_{0} \left( {D\left( {g\left| X \right.} \right)} \right) \ne ASE_{0} \left( {D\left( {X\left| g \right.} \right)} \right) \\ \end{gathered} $$
(34)

Also, the empirical findings (see discussion with Table 2 below) suggest that under the same conditions as above, \(ASE_{1} (G) = ASE_{1} \left( {D\left( {g\left| X \right.} \right)} \right) \ne ASE_{1} \left( {D\left( {X\left| g \right.} \right)} \right)\) although showing this algebraically is not obvious; the formulae (8), (13), and (14) use different sources of information.

4 Numerical examples

4.1 A simple comparison

The estimates by G and D are compared first using a simple dataset with two sets of variables with a narrower scale (Table 1): a binary set (items A1, A2, A3) and a polytomous set (items B1, B2, B3). In both sets, one item follows a deterministic pattern without tied pairs and without stochastic error (A1 and B1)—here we expect to see perfect item discrimination and D = G = 1; one item without tied pairs and including stochastic error (A2 and B2)—here we expect to see G = D ≤ 1; and one item with tied pairs and including stochastic error (A3 and B3)—here we expect to see G > D. In all cases, the score with the wider scale includes a small number of tied cases, just to show their effect in the estimates. As an example, the estimates and statistics by variable g = A2 and X are illustrated in its form of two-ways contingency table (Table 2).

Table 1 Example of the estimates by G and D under different conditions; X in column
Table 2 Two-way contingency table for variables A2 and X related to Table 1

Given Table 2, the number of pairs in the same direction is \(P = 2 \times \left( {13 \times 5 + 2 \times 4} \right) = 146\) and the number of pairs in the opposite directions is \(Q = 2 \times \left( {13 \times 0 + 2 \times 1} \right) = 4\), and consequently, \(P{-}Q = 142\) and \(P + Q = 150\). The number of all pairs in the direction of “g given X” is \(D_{g} = 20^{2} - \left( {225 + 25} \right) = 150\) and the number of pairs in the direction of “X given g” is \(D_{X} = 20^{2} - \left( {1 + 1 + ... + 1} \right) = 400 - 32 = 368\). Then \(G = {{142} \mathord{\left/ {\vphantom {{142} {150 = 0.947}}} \right. \kern-\nulldelimiterspace} {150 = 0.947}}\), \(D\left( {g\left| X \right.} \right) = {{142} \mathord{\left/ {\vphantom {{142} {150 = 0.947}}} \right. \kern-\nulldelimiterspace} {150 = 0.947}}\), \(D\left( {X\left| g \right.} \right) = {{142} \mathord{\left/ {\vphantom {{142} {368 = 0.386}}} \right. \kern-\nulldelimiterspace} {368 = 0.386}}\), \(D\left( {sym} \right) = {{142} \mathord{\left/ {\vphantom {{142} {\tfrac{1}{2}\left( {150 + 368} \right) = 0.548}}} \right. \kern-\nulldelimiterspace} {\tfrac{1}{2}\left( {150 + 368} \right) = 0.548}}\), and \(tau - b = {{142} \mathord{\left/ {\vphantom {{142} {\sqrt {\left( {150 \times 368} \right)} = 0.604}}} \right. \kern-\nulldelimiterspace} {\sqrt {\left( {150 \times 368} \right)} = 0.604}}\).

The calculation of the ASEs and confidence intervals of G and D are presented in Appendix. Given Table 2, \({\text{ASE}}_{1} (G) = 0.0592\) and \({\text{ASE}}_{0} (G) = 0.2515\). Then, the traditional asymptotic 95% confidence interval for the true γ is \(\gamma = 0.947 \pm 2.101 \times 0.0592 = \left[ {0.822,\left. 1 \right]} \right.\)Footnote 4 and the asymptotic significance, when testing the hypothesis γ = 0, is \(Z = {{0.947} \mathord{\left/ {\vphantom {{0.947} {0.2515}}} \right. \kern-\nulldelimiterspace} {0.2515}} = 3.764\) leading to p < 0.001. The corresponding ASEs of the directed Ds are \(ASE_{1} \left( {D\left( {g\left| X \right.} \right)} \right) = 0.0592\), \({\text{ASE}}_{0} \left( {D\left( {g\left| X \right.} \right)} \right) = 0.2515\), \(ASE_{1} \left( {D\left( {X\left| g \right.} \right)} \right) = 0.1003\), and \({\text{ASE}}_{0} \left( {D\left( {g\left| X \right.} \right)} \right) = 0.1025\) (see Appendix). Somers’ D estimates the true probability δ. Then, the traditional asymptotic 95% confidence intervals for δ are \(\delta \left( {g\left| X \right.} \right) = 0.947 \pm 2.101 \times 0.0592 = \left[ {0.822,\left. 1 \right]} \right.\), and \(\delta \left( {X\left| g \right.} \right) = 0.386 \pm 2.101 \times 0.1003\)\(= \left[ {0.261,\left. {0.597} \right]} \right.\). When testing the hypothesis \(\delta \left( {g\left| X \right.} \right)\) = 0, \(Z = {{0.947} \mathord{\left/ {\vphantom {{0.947} {0.2515}}} \right. \kern-\nulldelimiterspace} {0.2515}} = 3.764\) with p < 0.001 and for \(\delta \left( {X\left| g \right.} \right)\) = 0, \(Z = {{0.3859} \mathord{\left/ {\vphantom {{0.3859} {0.1025}}} \right. \kern-\nulldelimiterspace} {0.1025}} = 3.764\) with p < 0.001. We note the identical test statistics and identical statistical inference by D as with G.

Some lifts from Tables 1 and 2 are highlighted. First, we note the relevant direction of association discussed in Footnote 2. G (“g given X”) and D (“g given X”) are the estimators that detect the deterministic pattern of item discrimination in items A1 and B1. This was expected because of Eqs. (19), (27) and (30) related to JT statistic; in the deterministic patterns as in A1 and B1, \(JT_{gX}^{Obs}\) = \(JT_{gX}^{Max}\) and, consequently, G = D = 1. Second, with items A1 and A2 as well as B1 and B2, \(G = D\left( {g\left| X \right.} \right) \ne D\left( {X\left| g \right.} \right) \ne D\left( {Sym} \right)\) because there are no tied pairs related to X and, then, (P + Q) = \(N^{2} - \sum\limits_{i = 1}^{R} {\left( {n_{i}^{2} } \right)} \ne N^{2} - \sum\limits_{j = 1}^{C} {\left( {n_{j}^{2} } \right)}\). This was expected because of Eq. (33). Third, when there are tied pairs (A3 and B3), \(G > D\left( {g\left| X \right.} \right)\) because P + Q < \(P + Q + 2T_{g}\). This is expected because of Eqs. (7) and (10). Fourth, in the case that there are no tied pairs (A1, A2, B1, B2), \(ASE_{1} (G) =\)\(ASE_{1} \left( {D\left( {g\left| X \right.} \right)} \right) \ne ASE_{1} \left( {D\left( {X\left| g \right.} \right)} \right)\) and \(ASE_{0} (G) =\)\(ASE_{0} \left( {D\left( {g\left| X \right.} \right)} \right) \ne ASE_{0} \left( {D\left( {X\left| g \right.} \right)} \right)\). This is expected because of Eq. (34).

To verify the result, we could restudy the dataset by pivoting the cross-tables such that X is the row factor and the g is column factor. We would see that, in items A1, A2, B1, and B2, \(G\left( {g\left| X \right.} \right)\) = \(D\left( {g\left| X \right.} \right)\) \(\ne\) \(D\left( {X\left| g \right.} \right)\).

4.2 A comparison of G and D with a larger dataset

In the second comparison of the estimates by G and D, a wider survey of the behaviour of G and the difference variants of D was conducted. In the comparison, 13,392 test items from 1,292 tests were formed by different combinations of single items and sub-scores constructed by different item compilations based on randomly selected test-takers from a national-level dataset of 4,000 test-takers of a mathematics test for grade 9 with 30 binary items (FINEEC 2018). In the original dataset, the item discrimination ranged \(0.332 < PMC = \rho_{gX} = Rit < 0.627\) with the average \(\overline{Rit} = 0.481\), the difficulty levels of the items ranged 0.24 < p < 0.95 with the average difficulty level of \(\overline{p}\) = 0.63, and with the lower bound of reliability of α = 0.885. A small number of artificial datasets (13% of tests) were constructed to cover the very difficult and extremely difficult tests. Finally, a set of 1,292, mostly real-world datasets with different number of test-takers (N = 50–100–200), test lengths (k = 2–30), difficulty levels (\(\overline{p}\) = 0.08–0.96), reliabilities (\(\alpha\) = 0.74–0.98), and degrees of freedom in the item df(g) = 1–15, and in the score df(X) = 12–27) with 13,392 partly related test items was formed to compare the estimates by G and D. The average estimates are collected in Table 3 and Fig. 2. The main outcome of the survey is that G really follows the trend of D(gX) and not D(Xg) nor the symmetric D (Fig. 2). Using the same logic of naming as with D, G is, factually, G(gX).

Table 3 Average estimates of G and D based on the real-world datasets
Fig. 2
figure 2

Comparison of G and D by the degrees of freedom of the item; df(g) = R‒1; df(g) = 13 is combined 13–15; k = 13,392 items

5 Conclusions and possibilities of G in the measurement modelling settings

5.1 General notes on the results

The main result is that, although Goodman–Kruskal gamma is usually taken as a symmetric measure, it is, in fact, a directional measure the same manner as is Somers’ D. The direction in G is not determined by rows and columns but, instead, G is directed to the way where the order in the variable with a narrower scale depends on the order in the variable with a wider scale in the analysis. This direction makes sense in the measurement modelling settings where it is assumed that the latent trait manifested as the score or the measurement scale with wider scale explains the response pattern in the item with the narrower scale (see, e.g., Kim 1971; Byrne 2001; Metsämuuronen 2017). This directional nature of G may explain partly the potential “inflation” discussed by, for example, Higham and Higham (2019), Kvålseth (2017), and Masson and Rotello (2009).

That G is a directional measure is somewhat alarming from the viewpoint of using it in general settings; it is strongly recommendable that gamma should not be used as a symmetric measure, and it should be used directionally only when willing to explain the response pattern in a variable with a narrower scale by the variable with a wider scale.

5.2 Possibilities of G in the measurement modelling settings

That G is not related to rows and columns but to the widths of the scales is a positive matter within measurement modelling settings: G leads strictly to the logical direction from the theory viewpoint where the latent trait manifested as the score or the measurement scale drives the responses in the test item. Hence, G could be an asset in measurement modelling settings. While \(D\left( {g\left| X \right.} \right)\) is raised as one of the “superior alternatives” to PMC in the binary case (Metsämuuronen 2020a), G would be even “more superior alternative” than D (Metsämuuronen 2021). After all, while D tends to underestimate association of an item and the score in an obvious manner when the item has three categories or more (see Göktaş and İşçi 2011; Metsämuuronen 2020a), G does not underestimate IDP to that extent (Metsämuuronen 2021).

As being a directional measure and one of the “superior alternatives” to PMC, G have strict relevance in a new concept of “SME-corrected” estimates of reliability proposed by Metsämuuronen (2021). Namely, it is known that coefficient alpha, as an example, a classical estimator of the lower bound of reliability, can be expressed using item–total correlation (PMC = \(\rho_{gX}\)):

$$ \alpha = \frac{k}{k - 1}\left( {1 - \frac{{\sum\limits_{g = 1}^{k} {\sigma_{g}^{2} } }}{{\left( {\sum\limits_{g = 1}^{k} {\sigma_{g} \rho_{gX} } } \right)^{2} }}} \right) $$
(35)

(Lord and Novick 1968). It is also known that the element \(\rho_{gX}\) = PMC in Eq. (35) always underestimates the true association between the item and the score because of RR and several other sources of SME (Metsämuuronen 2021) and, hence, the magnitude of the estimate of reliability is reduced because of mechanical reasons. If we use G instead of PMC in the form, we will get one option for a “SME-corrected” estimate of reliability:

$$ \alpha_{G} = \frac{k}{k - 1}\left( {1 - \frac{{\sum\limits_{g = 1}^{k} {\sigma_{g}^{2} } }}{{\left( {\sum\limits_{g = 1}^{k} {\sigma_{g} G_{gX} } } \right)^{2} }}} \right) $$
(36)

(Metsämuuronen 2021), which suffer remarkable less loss of information related to SME than the original estimator because G is less affected on SME (Metsämuuronen 2020c, 2021). The matter is not elaborated further here; more “SME-corrected” formulae of reliability can be found in Metsämuuronen (2021) and, specifically, in Metsämuuronen (2020c). More studies in this area would enrich our knowledge of the matter.

Another possibility in the directionality of G is its potential use as an indicator of explaining power in the form of \(G^{2}\) in the same manner as we use \(\rho_{XY}^{2}\) for two metric variables and \(\eta^{2}\) for a categorical and a metric variable. The advance of \(G^{2}\) is that, in the case that the scales of the variables differ from each other, unlike \(\rho_{XY}^{2}\) and \(\eta^{2}\), \(G^{2}\) can reach correctly also the extreme value + 1. Assumingly, using \(G^{2}\) would give us a kind of “SME-corrected” estimate of the explaining power indicating how well the variable with a wider scale explains the response pattern in the variable with a narrower scale. This would be useful, specifically, in the measurement modelling settings with binary items where \(\rho_{XY}^{2}\) and \(\eta^{2}\) may underestimate the association remarkably. This area would be worth studying more.

General advances of G in the measurement modelling settings were already discussed Introduction (see also Metsämuuronen 2021).

5.3 Limitations

An obvious limitation of the study is that the survey with real-world items that was used to illustrate the connection between G and the variations of D carries its own limitations. Although the numbers of subtests (n = 1296) and items (k = 13,392) used in the survey are rather convincing, those are based on one basic dataset. Results may have been somewhat different if truly polytomous test items were used in the simulation. Replications of the design or another approach with a more independent estimates may increase our knowledge of the relation between the estimators. Another obvious limitation is that the algebraic connection of the obviously different forms of \({\text{ASE}}_{1} (G)\) and \({\text{ASE}}_{1} \left( {D\left( {g\left| X \right.} \right)} \right)\) was noted in the empirical dataset but it was not shown in an algebraic manner.

When it comes to G itself, because of carrying, largely, the same characteristics as D, G also has some of the known disadvantages noticed in D. In item analysis settings, one of these is that G tends to underestimate the association of g and X in an obvious manner, when the number of categories in g is large, more than four (Metsämuuronen 2021). Because G gives obvious underestimates of association in comparison with PMC, some correction may be proper to propose to enhance G against this deficiency. The possible correction needed in G in item analysis settings, where the score and the items are manifestations of the same latent variable and when we have a mechanical correction between these variables, is, undoubtedly, different than in the case of two independent variables. One option, suggested by Metsämuuronen (2021), a “dimension-corrected gamma” (G2), specific to the measurement modelling settings, transforms the linear nature in G toward the trigonometric nature. G2 seems to overcome the problem of obvious underestimation in item analysis settings without producing obvious overestimates (Metsämuuronen 2021). Studying these kinds of coefficients may enrich the discussion related to “SME-corrected” reliability (see above).

5.4 Further suggestions

Because G appears to be a directional measure, the developers of the enhanced or corrected G (e.g., Bai and Wei 2009; Highan and Higham 2019; Hryniewicz 2006; Kvålseth 2017; Masson and Rotello 2009; Rousson 2007) or enhanced procedures to estimate the confidence intervals for G (e.g., Van der Ark and Van Aert 2015; Woods 2007) may be willing to consider, if needed, their correction factors or estimators from this viewpoint also. Maybe the researcher working with D in connection with Harrell’s C and the related AUC and ROC (see Harrell 2001; Harrell et al. 1982; Heagerty and Zheng 2005), would be interesting in considering to study further the possibilities of G in relation with those tools (see Heagerty and Zheng 2005; Higham and Higham 2019) from the directionality viewpoint. Obviously, it would be suggested to reconsider the texts also in the textbooks and manuals considering G as a symmetric measure (e.g., IBM 2017; Metsämuuronen 2017; Sheskin 2011; Sirkin 2006; Wholey et al. 2015).

All in all, the directional nature in G and D may be worth considering within the measurement modelling settings. A relevant question arising from the directionality embedded to G and D is why are we would use, in the first place, the nondirectional correlation coefficients while the philosophy of measurement modelling is based on the idea of directionality that the latent trait manifested as the score drives to the observed behaviour in the item and not the other way round. Then, studying the family of the directional coefficients of correlation could enrich the discussions related to such areas as the estimators of reliability or item discrimination power, or calculating the factor loadings used in estimating maximal reliability. G and D with a directional nature could be worth considering in these areas.