The need for statistical literacy—described as “the ability to interpret, critically evaluate, and communicate statistical information and messages” (Gal, 2002, p. 1)—is receiving growing worldwide attention. The United Nations Economic Commission for Europe (2009) published A Guide to Improving Statistical Literacy, a manual aimed at promoting statistical literacy among researchers, educators, businesses, policy makers, and the general public, and in 2010 the Royal Statistical Society launched getstats, a 10-year statistical literacy campaign. Despite calls for statistical literacy, most statistical models used today (e.g., correlation and its generalized forms, such as regression) in health, business, education, and behavioral research are difficult for many people to understand (e.g., Brooks, Dalal, & Nolan, 2014; Dunlap, 1994). Lack of statistical literacy is partly responsible. But the complexity of statistical models and the statistical jargon used by researchers must also bear some of the blame.

Correlational analysis is one of the most commonly used statistical techniques, and it exemplifies the latter problem. Although it is arguably among the least complex statistical models used by researchers, correlation is not conducive to intuitive understanding. Consider, for example, a correlational analysis that shows a significant positive correlation between physical exercise and mortality, with a correlation estimate of r = .60. The squared value, r2 = .36, can be interpreted to mean that 36% of the variance in mortality can be explained by variance in physical activity. This description does not facilitate an intuitive mental picture of the important relationship between physical activity and mortality. The underlying statistical jargon, “proportion of variance explained,” is not easily understood by members of the public, policy makers, and public officials (May, 2004). Even some researchers in psychology may not be comfortable with this kind of statistical terminology (Brooks et al., 2014).

In addition, correlational analysis is problematic under many common conditions. Although the most commonly used method for estimating correlation, Pearson’s r, is well-defined when the means and variances of X and Y are well-defined (e.g., Hogg & Craig, 1971), drawing a valid inference from the interpretation of r becomes less probable when the conditions of linearity, bivariate normality, and no outliers are violated (Onwuegbuzie & Daniel, 2002). Such condition violations are common in many research domains (e.g., engineering psychology: Bradley, 1982; educational and clinical psychology: Micceri, 1989), making correlational analysis, and parametric statistics in general, insufficient in many cases. Consequently, there have been increasing calls for the use of nonparametric statistics (Leech & Onwuegbuzie, 2002), going at least as far back as Siegal’s (1956, p. vii) observation that “non parametric techniques of hypothesis testing are uniquely suited to the data of the behavioral sciences.”

Researchers recognize multiple ways in which two continuous variables can be meaningfully related. Two general forms of relation are commonly evaluated: linearity and monotonicity. By definition, a monotonic relationship between X and Y indicates that the direction of change in Y when X changes is preserved across levels of X. Linear functions are monotonic. Other examples of monotonic functions are exponential function when the exponent is odd [e.g., X = Y3, X = Y5, or X = ln(Y)]. It is generally suggested (e.g., Howell, 2013; Wilcox, 2012) that researchers use r for detecting linear relationships between X and Y and use nonparametric correlations (e.g., Spearman’s rank correlation, rs; Kendall & Gibbons, 1990) for detecting nonlinear monotonic relationships.Footnote 1

Perhaps less well-known are the nonparametric methods available to estimate other forms of bivariate relationships. Generally, methods for detecting nonlinear monotonic bivariate relationships are even more complex and less understandable than correlation. Researchers need statistical methods that will allow them to make valid inferences from their data while still enabling them to communicate their analysis in a manner conducive to knowledge mobilization. What, then, is the solution?

We aim to contribute to such a solution by introducing a new statistic that can be more easily understood than correlation and is robust to common violations of parametric assumptions. Probability of bivariate superiority (PBS) is a nonparametric procedure for directly estimating the probability of superiority in a bivariate relationship between two continuous variables. The resulting effect size, Bp, is a common-language effect size estimate of the probability that a respondent who scores high (or low) on X will also score high (or low) on Y. Common-language effect sizes like Bp are interpreted using language that is more familiar to and more intuitively understood by people without a statistical background (Brooks et al., 2014).

Until now, nonparametric common-language effect sizes (Vargha & Delaney, 2000) have been applied mostly to between-group comparisons. McGraw and Wong (1992) introduced a common-language effect size (CL), interpreted as the probability that a score sampled at random from one group will be larger than a score sampled at random from the other group. Grissom (1994) coined the term probability of superiority to appropriately describe the relationship estimated by CL, and Vargha and Delaney introduced the nonparametric estimator A of this probability. Dunlap (1994) derived the formula to convert r to a common-language effect size (CLr) that describes the probability that when an x score is above (or below) the mean of all X, its paired y score is above (or below) the mean of all Y. With these developments, the common-language probability-of-superiority approach has started to gain traction in psychology and other disciplines (e.g., biology: Ling & Nelson, 2014; education: Huberty, & Lowman, 2000).

In this article we introduce Bp, a nonparametric extension of CLr. It is important to note that Bp depends upon neither linearity, as r does, nor monotonicity, as rank correlation does. Probability of superiority in a bivariate relationship can exist and can be appropriately interpreted independent of there being a linear or monotonic relationship between X and Y. Bp does not rely on the parametric assumptions upon which r and CLr depend and that are commonly violated in real-world research. But like CLr, it is an effect size that can be interpreted as a likelihood, making it easier to understand (Brooks et al., 2014).

Like correlation, PBS indicates correspondence between X and Y scores but does not imply causation. PBS does, however, make a practically important relationship between X and Y easier to interpret. For example, rather than trying to comprehend that 36% of the variance in mortality can be explained by variance in physical activity, it is easier to understand that there is a 70% likelihood that seniors who exercise more than 1 hour per day (national average) will live longer than 81 years (national average). Not only is this likelihood description easier for most people to understand, it also communicates concrete guidelines for a recommended course of action (exercise more than 1 hour daily) and why to do it (increased chance of living longer than 81 years). Practically speaking, understanding the nature of the relationship between lifestyle choices and health may increase seniors’ willingness to make beneficial lifestyle changes.

The probability of superiority in bivariate relationships has received little attention, despite calls from researchers in other disciplines (e.g., Nelson, 2006). In the remainder of this article, we begin to fill this gap. In the sections that follow, we (a) discuss how linearity can be estimated by r, monotonicity estimated by nonparametric correlations, and PBS estimated by common-language effect sizes (CLESs); (b) review the common-language effect sizes and Dunlap’s (1994) CLr; (c) explain the copula for the probability of bivariate superiority (γ) that is estimated by CLr when the assumptions for r are met; (d) introduce our common-language effect size, Bp, for estimating a PBS relationship; (e) describe the PBS algorithm we have developed to directly compute Bp as a robust estimator of γ; (f) describe our Monte Carlo experiment and the resulting evidence of the robustness of Bp; and (g) demonstrate the application of PBS using both a simulated dataset and real data. Finally, we offer a general discussion of the implications of PBS for past and future research in the behavioral and social sciences.

Linearity measured by Pearson’s correlation coefficient r

In 1895, Karl Pearson (1895) presented the algorithm for a ground-breaking statistical concept, Pearson’s correlation coefficient r, which can be used to assess the degree and direction of linear association between two continuous variables. Since its development, r and its generalized forms (e.g., multiple correlation, R in regression) have been widely employed in behavioral and social sciences. According to Rodgers and Nicewander (1988), r (and its related concepts) is undoubtedly one of the most revolutionary mathematical and statistical procedures in the 20th century. Many commonly used statistical models have been developed on the basis of the concept of r, including its generalized forms (e.g., multiple regression, structural equation modeling), robust forms (e.g., Spearman’s correlation, Kendall’s tau correlation), and its extended applications (e.g., mediation models, reliability assessment). In equation, r is an estimator for the linear association between X and Y, when X and Y are linearly related (Hogg & Craig, 1971), which can be expressed as

$$ \rho =\frac{E\left[\left(X-{\mu}_X\right)\left(Y-{\mu}_Y\right)\right]}{\sigma_X{\sigma}_Y}, $$
(1)

where ρ is the population correlation coefficient, E[(X − μX)(Y − μY)] is the expected value of the multiplicative scores of (X − μX) and (Y − μY), μX is the population mean of X, μY is the population mean of Y, σX is the population SD of X, and σY is the population SD of Y. In practice, the population values are replaced with the sample values in Eq. (1) for estimating the sample correlation r. Possible values of r range from – 1 to + 1 (i.e., perfect-negative to perfect-positive linear correlation).

Linearity is the central property of the relationship that r describes. Theoretically speaking, and as is implied in Eq. 1, the correlation coefficient r is well-defined as long as the means and variances of X and Y are well-defined, regardless of their distributional characteristics. However, Hogg and Craig (1971) stated that the correlation coefficient r proves to be a useful estimator for linear relationship only for certain kinds of distributions of two random variables. Figure 1 shows scatterplots of 100 simulated X scores coming from normal, uniform, positively skewed, and negatively skewed distributions, and of the 100 simulated Y scores conditional on the X scores, based on the linear correlation value r = .80. Although the correlation coefficient r can reliably describe and measure how the X and Y scores concentrate in a line, in some scenarios (e.g., positively skewed X) the locations of extreme XY points on the XY plane may not be meaningful. This is because the linear relationship may be overstated, given that the majority of other points are relatively spread apart or unrelated, and there is a line that connects these points with a few outlier points on the XY plane. Hence, the correlation coefficient r is more useful when X and Y are symmetrically distributed (e.g., normal, uniform) than when X and Y are asymmetrically distributed, despite the fact that “the formal definition of ρ does not reveal this fact” (Hogg & Craig, 1971, p. 74).

Fig. 1
figure 1

Scatterplots for linear-based XY space with a normal, a uniform, a positively skewed, and a negatively skewed distribution, when the true correlation coefficient is .80

Furthermore, when the relationship between X and Y is not linear (e.g., quadratic, cubic, quartic, or quintic), correlational analysis can lead researchers to inaccurate inferences. Figure 2 shows 10,000 normal X and normal Y that perfectly follow quadratic, cubic, quartic, and quintic bivariate relationships, respectively. For the quadratic data, r = .009; for cubic, r = .780; for quartic, r = .027; and for quintic, r = .512. If r is found to be statistically significant, a researcher may incorrectly infer that a linear relationship does exist (see note 1 above). If r is found to be not statistically significant, a researcher may incorrectly infer that there is no important relationship between the variables. This could explain why many applied statisticians and methodologists suggest that r should only be used to detect the direction (positive/negative) and magnitude of a linear relationship between X and Y when X and Y form a bivariate normal distribution (e.g., Howell, 2013).

Fig. 2
figure 2

Scatterplots and observed correlation coefficient rs for quadratic, cubic, quartic, and quintic bivariate X and Y

Limited interpretability

It took the publication of Cohen’s (1988) highly influential text for meaningful interpretation of r to be appreciated in the dissemination of research findings. Cohen’s guidelines used the concepts of proportion of variance explained and coefficient of determination, and he suggested rough categorical guidelines for the treatment of r as an effect size. For example, consider the relationship between academic achievement, measured using college GPA, and students’ motivation. A computed correlation between these two variables, r = .30, can be interpreted by first computing its squared value (r2 = .09). The r2 value can in turn be interpreted to mean that 9% of the variance in GPA can be explained by variance in motivation. In this case, .09 is the coefficient of determination. Despite Cohen’s efforts, r2 (or r) arguably remains one of the most confusing statistical concepts in behavioral and social research.

According to May (2004), three guidelines are essential for better disseminating statistical information: understandability, interpretability, and comparability. Understandability is enhanced when statistics are presented in plain language, without statistical jargon and assumptions. Interpretability requires the metric of a statistic to be familiar and easily understandable by the public. Comparability demands that a statistic be compared directly, without any need for manipulation and modification. Correlation meets this last requirement only, but it is certainly deficient in terms of understandability and interpretability.

In light of the difficulty with understanding and interpreting r, Brooks et al. (2014) conducted two experiments in which they recruited undergraduate students and asked them to rate statistical information on the basis of the three criteria of understandability, usefulness, and effectiveness. The statistical information was presented as (a) proportions of variance explained (or the coefficient of determination; r2), (b) probability-based common-language effect sizes (CL), and (c) tabular binomial effect size displays (BESD). Participants perceived both CL and BESD as significantly more understandable and useful than r2. Referring back to May’s (2004) guidelines, it easy to appreciate why the proportion of variance explained was not preferred: It is difficult to understand because it is pure statistical jargon that is challenging to interpret.

The interpretative challenge is especially problematic when Cohen’s (1988) effect size guidelines are applied. For example, r2 = .09, interpreted as 9% of variance explained, is considered a medium effect size. But a person not fully comfortable with statistical terminology may justifiably conceive of 9% as a very low proportion, making any reference to it as a medium effect size confusing and perhaps even perceived to be misleading. Using a metric with these weaknesses can compromise attempts to disseminate findings. Despite the aforementioned weaknesses, r remains a commonly used measure of correspondence between X and Y scores.

When interpreted strictly as an indicator of directionality of Y to X correspondence, r is less difficult to understand. It is apparent from the sign of r whether Y increases or decreases as X increases. Unfortunately, such simplified interpretation has distinct disadvantages. First, it ignores the magnitude of the relationship, making it difficult to evaluate the importance of the relationship. Second, inferences based on this simplified interpretation remain subject to error when the true relationship is not linear. For example, when X and Y are related by a quadratic function the simplified interpretation of r can lead to the incorrect inference that Y neither increases nor decreases as X increases, when in fact it does both.

Monotonicity measured by nonparametric correlations

Commonly used nonparametric alternatives to r are Spearman’s rho, Kendall’s tau, and robust regression. These alternatives depend on monotonicity, but not linearity, and can still be interpreted as estimates of correspondence. In the case of Spearman’s rho, the coefficient of determination describes the proportion of variance in ranks of Y scores explained by variance in ranks of X scores. Kendall’s tau measures the number of concordant XY pairs relative to the number of discordant pairs. Robust regression provides a robust correlation estimate derived from the slope estimated by fitting a robust regression model between X and Y. The correspondence described by r is between scores of X and Y, whereas the nonparametric alternatives describe correspondence of a different nature, complicating interpretation. Nevertheless, researchers may be tempted to interpret the nonparametric statistics as describing a linear relationship, even if it is not between the scores themselves, an error that could be just as misleading as interpretation of r when a linear relationship does not exist (Wilcox, 2012).

PBS measured by common-language effect sizes (CLESs)

Parametric CLES

Wolfe and Hogg (1971, p. 30) observed that probability estimates are statistics that “frequently make more sense to the consumers of statistical studies than do the statistics that are now reported in the literature.” Their leading example was the probability that an X score is greater than a Y score. On the basis of this work, McGraw and Wong (1992) proposed that this probability be formalized as a common-language effect size (CL). Let {\( {X}_i\sim N\left({\mu}_i,{\sigma}_i^2\right) \); i = 1, 2} be jointly normally and independently distributed random variables that represent responses to two conditions (e.g., treatment and control). McGraw and Wong’s CL is the sample estimator for P(X1 > X2). In equation,

$$ CL=\Phi \left[\left({\overline{X}}_1-{\overline{X}}_2\right)/\sqrt{s_1^2+{s}_2^2}\right], $$
(2)

where \( \left({\overline{X}}_1-{\overline{X}}_2\right) \) is the sample mean difference, \( {s}_i^2 \) is the sample variance for Group i = 1, 2, and Φ is the standard normal distribution function. Simply put, CL describes the estimated probability that a score sampled at random from distribution 1 will be larger than a score sampled at random from distribution 2. For example, there is a 70% likelihood that a randomly selected treatment group participant performs better on a cognitive ability test than a randomly selected control group participant. A CL value of .5 indicates stochastic equivalence between the two distributions. A value of 1 implies perfect stochastic superiority of one distribution over another. Grissom (1994) derived additional techniques to estimate P(X1 > X2) under various conditions and adopted a fitting label to describe this likelihood: probability of superiority.

Nonparametric CLES

Vargha and Delaney (2000), expanding on work pioneered by Cliff (1993), later developed a robust estimator (A) of CL. This development enabled application of the useful and understandable common-language probability-of-superiority conceptualization to data that do not meet parametric assumptions:

$$ A=\left\{{\sum}_{i=1}^{n_1}{\sum}_{j=1}^{n_2}\#\left[{y}_1(i)>{y}_2(j)\right]+.5\#\left[{y}_1(i)={y}_2(j)\right]\right\}/{n}_1{n}_2, $$
(3)

where # is the count function, y1(i) and y2(j) are the ith and jth observations of Y in Groups 1 and 2, respectively, and ni is the size for Group i = 1, 2.

CL and its derivatives are becoming more widely employed (Brooks et al., 2014; Cliff, 1993; Li, 2015; Ruscio, 2008) in behavioral and social sciences research situations involving one nominal variable and one dependent variable. In particular, A has been identified as especially useful because it is robust to violations of parametric conditions assumed in CL (Ruscio, 2008) and it exhibits characteristics May (2004) has identified as important for effective dissemination of statistical information: understandability, interpretability, and comparability.

A CLES for continuous bivariate data

Recognizing that understandability and interpretability were lacking in r, Dunlap (1994) proposed an extension of the common-language conceptualization of effect size to research scenarios involving two linearly related bivariate normal variables (i.e., the case in r). Dunlap’s proposal utilized Sheppard’s theorem (Kendall & Stuart, 1977) to convert r to a common-language effect size estimate,

$$ {CL}_r=\left[{\mathit{\sin}}^{-1}(r)/\pi \right]+.5, $$
(4)

where sin−1 is the inverse sine function and π is a constant (≈ 3.14159; for a mathematical proof, see the Appendix). For example, instead of saying that 16% (r2 = .16) of variance of sons’ heights is explained by variance in a fathers’ heights, one can state that “a father who is above average in height has a 63% likelihood of having a son of above-average height” (Dunlap, 1994, p. 510). Following Grissom’s (1994) lead, we describe this likelihood as the probability of bivariate superiority (PBS) and label the resulting parameter γ. We can formalize Dunlap’s (1994) conception of CLr and of PBS as

$$ \gamma =P\left(Y>\overline{Y\ }\cap X>\overline{X}\right) $$
(5)

where ∩ refers to the intersect function, meaning that both the conditions of \( Y>\overline{Y\ } \) and \( X>\overline{X} \) should be met for the joint probability of γ. In practice, researchers may not have a full dataset from the entire population, and Bp is denoted as a sample estimator for γ.

Understanding a bivariate relationship in terms of γ is conceptually similar to Blomqvist’s (1950) q test of dependence. Whereas γ is concerned with the distribution of XY scores evaluated with reference to the mean values of X and Y, Blomqvist’s procedure plots the bivariate data into four quadrants according to the median values of X and Y. The q test is based on the count of scores in each quadrant and the resulting metric lies between – 1 and + 1, which brings with it interpretive difficulties similar to r. When the condition of bivariate normality is met, the mean and median of X are equal, and the mean and median of Y are equal. Under these conditions CLr is mathematically equivalent to q’ (see the Appendix) but is expressed as a likelihood for greater interpretability.

It is apparent to us that Dunlap (1994) intended his extension of CL to CLr chiefly to improve understandability of linear relationships by describing them in intuitive terms. His work was a worthwhile undertaking and an impressive innovation that opened the door to a new potential understanding of bivariate relationships. CLr is a more understandable way to describe the relationship between X and Y, as a probability of superiority, and it makes both the direction and the magnitude of the relationship comprehensible. But CLr, like q, describes a different relationship than r, and does so in terms that are unrelated to linear correspondence. In fact, reading Dunlap’s description and looking at Eq. 5, it is apparent that γ, and therefore CLr, describes a relationship that is not necessarily linear. Nevertheless, Dunlap’s conversion formula is based on the assumption that X and Y are linearly related, and this limits its usefulness.

Equation 4 implies a dependence between the existence of linearity and the existence of a probability-of-superiority effect. However, this implication may not hold. It is possible for a probability-of-superiority relationship to exist between X and Y in the absence of a linear relationship. Figure 3 depicts two idealized plots of the probability of bivariate superiority. Plot 3.A.vii shows a perfect linear relationship between X and Y (r = 1), which implies that a PBS relationship exists, with Bp= CLr = 1. This example is congruent with Dunlap’s (1994) assertion that linearity is sufficient for a PBS relationship to exist. Plot 3.B.vii depicts a perfect PBS relationship (Bp = 1) in which the underlying relationship between X and Y is not linear. This example shows that a linear relationship is not a necessary condition for a PBS relationship to exist.

Fig. 3
figure 3figure 3

Sample scatterplots for bivariate normal correlations with ρ = .05, .1, .3, .5, .7, .9, and 1 (left) (which, when transformed using Eq. 4, provides CLr = .516, .532, .597, .667, .747, .856, and 1), as well as the equivalent probability-of-bivariate-superiority relationships without the requirement of a linear relationship, with the true PBS values γ = .516, .532, .597, .667, .747, .856, and 1 (right)

Since Dunlap’s (1994) bold advance, there has been no development of CLr as a measure of probability of superiority in bivariate relationships independent from r. It is worth speculating that Dunlap’s conversion formula could be applied to nonparametric correlation coefficients when the parametric assumptions for r are violated, but this possibility has not previously been investigated. We tested the usefulness of such an application in our Monte Carlo experiment. Our central contribution is the introduction of a method to directly estimate PBS in the absence of a linear relationship.

Our proposed B p: A nonparametric extension of CL r

Taking inspiration from the work of Vargha and Delaney (2000), who developed the robust estimator A for CL, we have developed a robust estimator, Bp, of Dunlap’s (1994) CLr. Bp estimates the magnitude of a PBS relationship between two variables, X and Y, without the restrictions of bivariate normal correlation that are required for the conversion in Eq. 4. Like CLr, Bp is conceptualized as the probability that when an X score is above (or below) the mean of all X scores, its paired Y score is also above (or below) the mean of all Y scores, as formalized in Eq. 5. When X and Y follow a bivariate normal distribution and form a linear relationship, Bp directly estimates CLr without relying on a transformation from r.

It is noteworthy that PBS is a special case under copula theory. A copula is used to measure joint distributions of two or more random variables (e.g., Botev, 2017; Jaworski, Durante, Härdle, & Rychlik, 2010; Nelson, 2006). A review of copula theory as it relates to bivariate relationships is beyond the scope of this article, so we refer the reader to Lai and Balakrishnan (2009, chap. 2) for a review. In lay terms, a copula is a statistical concept that explains how two variables are related to each other. In the bivariate XY case, a copula allows one to separate the joint XY distribution into two sources: the marginal distributions of each variable, and the copula that “glues” these variables into together. Linear association is only one example of many different types of “glues.” As we noted above, limiting investigation of bivariate relationships to the linear copula restricts the potential to identify other bivariate relationship forms that may be practically and theoretically important. Researchers need accessible methods to examine relationships described by a wide array of copulas, including PBS.

Applying copula theory to Blomqvist’s (1950) Eq. 2, we can define how the distribution of Y is glued to the distribution of X. Replacing Blomqvist’s medians with means in our division of quadrants, the copula determines the probability that an XY point falls within a particular quadrant given a value of γ. Let Xi follow a probability distribution (e.g., normal, lognormal, uniform, etc.). There exists a marginal probability distribution for Yi that is generated from the following function:

$$ {\displaystyle \begin{array}{l}{Y}_i\sim U\left({\mu}_Y,c\right),\mathrm{if}\;{X}_i>{\mu}_X\;\mathrm{and}\;\varrho \le \gamma, \\ {}{Y}_i\sim U\left({\mu}_Y,c\right),\mathrm{if}\;{X}_i>{\mu}_X\;\mathrm{and}\;\varrho >\gamma, \\ {}{Y}_i\sim U\left(-c,{\mu}_Y\right),\mathrm{if}\;{X}_i<{\mu}_X\;\mathrm{and}\;\varrho \le \gamma, \\ {}\ {Y}_i\sim U\left(-c,{\mu}_Y\right),\mathrm{if}\;{X}_i>{\mu}_X\;\mathrm{and}\;\varrho >\gamma, \\ {}{Y}_i={\mu}_Y,\mathrm{if}\;{X}_i={\mu}_X,\end{array}} $$
(6)

where c is the limit in a uniform distribution, μX is the population mean of X, μY is the population mean of Y, ϱ~U(0, 1) follows a uniform distribution with min = 0 and max = 1, and γ is the population PBS that relates X and Y. The generated ϱ values control the likelihood that a simulated Y score is above (or below) the mean of Y, when a simulated X score is above (or below) the mean of X, so that Y is related to X for a particular value of γ.

When the condition of bivariate normality is met, a correlational estimate (r) can be translated into the more understandable PBS using Dunlap’s (1994) common-language correlation transformation (CLr). To continue with the example in our introduction above, if data for physical activity and mortality meet the bivariate normality condition, then an estimated r of .60 can be converted to the PBS estimate of .705. Below we describe the algorithm by which this estimate can be directly computed without reference to r. Use of this algorithm makes it possible to directly compute the Bp estimate and detect PBS-based relationships even when conversion using CLr is also inappropriate, such as when bivariate normality is violated. Thus, Bp provides a robust estimator of the PBS-based relationship in the population, which we label γ.

Figure 3 shows scatterplots for X and Y when (A) X and Y are generated from the conventional condition of linearly related X and Y that forms the bivariate normal correlation (such that r can be mathematically linked to CLr in Eq. 4, which estimates the probability in Eq. 5), and (B) X and Y are directly generated from the PBS function (i.e., Eq. 6) without the unnecessary condition of linearity. In other words, the plots in column B show the relationship between X and Y is based only on a level of γ, and the plots in column A demonstrate the same type of PBS-based relationship between X and Y when the condition of linearity is also met. Whereas Dunlap’s (1994) CLr can accurately detect PBS if X and Y are linearly related and follow bivariate normal distributions, PBS-based relationships can exist when these conditions are not met.

As was noted by Reshef et al. (2011), a good statistic for measuring dependence should possess two heuristic properties—generality and equitability. Generality means that a statistic should detect and measure a wide range of possible associations between X and Y, not limited to specific relationships (e.g., linear). Equitability means that a statistic should give similar values “to equally noisy relationships of any types” (p. 1518). In light of these properties, we have developed a statistic (Bp; Eq. 7 below) for estimating the population PBS (γ) such that it can detect and estimate PBS whether or not a linear XY relationship exists.

First, we need an algorithm that can measure the number of times that Y is above (or blow) the mean of Y when the corresponding X score is above (or below) the mean of X. The count function, \( \#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)>0\right] \), serves this purpose. Second, there is a scaling algorithm, \( 0.5\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)=0\right] \); when \( \left({x}_i-\overline{x}\right) \) and/or \( \left({y}_i-\overline{y}\right) \) equal to 0, then a .5 unit is assigned to the Bp calculation. The purpose of this algorithm is to ensure that when there is zero PBS relationship for all the X and Y scores, the summation of the count will become half of the sample size (n), and hence, the Bp score (Eq. 7) is scaled to become .50, as in the case of the calculation for the A statistic in Eq. 3.

Given these considerations, we derived

$$ {B}_P=\frac{\sum_{i=1}^n\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)>0\right]+0.5\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)=0\right]}{n}, $$
(7)

where n is the size or number of paired observations, # is the count function, xi and yi are the scores or observations from an XY pair in a sample, \( \overline{x} \) is the sample mean of X, and \( \overline{y} \) is the sample mean of Y.

We expect that BP can give similar PBS scores for both the linear-based and PBS-based XY planes in Fig. 3, which meet the scientific features of generality and equitability for a good statistic (for details about the mathematical relationship between Bp and CLr; please see the Appendix). We conducted a Monte Carlo experiment to evaluate the behavior of BP and compared performance of this new statistic to the performance of Dunlap’s conversion formula applied to r and its nonparametric counterparts.

It is noteworthy that, even though BP appears to be and is positioned as a more general, interpretable, and robust statistic for measuring PBS-based bivariate relationship, it is not the most powerful statistic for detecting monotonic or linear relationships. Nor is BP an alternative to correlation coefficient r and nonparametric correlations (e.g., rank correlation) when a linear or monotonic relationship is hypothesized. In other words, the correlation coefficient r works best when estimating the linear relationship, Y = aX + b, and nonparametric correlations (e.g., rank correlation) are expected to work well when estimating linearity and monotonicity. In particular cases in which the underlying relationship between X and Y is monotonic, the nonparametric correlations (e.g., rank correlation) will generally indicate a stronger relationship than BP because they use more information about the data than just relationship to the mean. However, the rank correlation may fail when the relationship is no longer monotonic, as in the data following Eq. 6. Specifically, given that Eq. 6 sets up data to minimize the relationship between X and Y beyond the PBS relationship, X and Y are basically unrelated within a quadrant and particularly are not monotonically related within a quadrant. In short, despite previous research findings about the improved interpretability of PBS-related statistics, such as CLr, researchers should consider the type of bivariate relationships (i.e., linearity, monotonicity, and PBS) they are focusing on, and choose the corresponding statistic (e.g., r, rank correlation, or BP).Footnote 2

Design of the Monte Carlo experiment

Comparative estimates

Numerous robust estimators have been proposed for detecting XY associations when the parametric assumptions for r have been violated. Three common robust correlations, Spearman’s rank correlation (rs; Kendall & Gibbons, 1990), Kendall’s tau correlation (rt; Kendall, 1938), and robust regression correlation (rr; Wilcox, 2012), can be converted to a CLr value using Eq. 4. For comparative purposes this study examines the performance of such conversion of these robust correlation estimators, as well as the original Pearson-based CLr as defined by Dunlap (1994), as estimators of γ.

Spearman-based estimation (CL S)

Spearman’s correlation rs converts X and Y scores into rank scores, then applies Pearson’s product-moment correlation formula to calculate the distance between these rank scores and summarize an overall relationship between the ranks of X and Y scores. The resulting value can be plugged into (Eq. 4) to obtain an estimate of γ,

$$ {CL}_S=\left[{\mathit{\sin}}^{-1}\left({r}_s\right)/\pi \right]+.5=\left\{{\mathit{\sin}}^{-1}\left\{1-6{\sum}_{i=1}^n{d}_i^2/\left[n\left({n}^2+1\right)\right]\right\}/\pi \right\}+.5, $$
(8)

where \( {d}_i^2={\left[\operatorname{rank}\left({x}_i\right)-\operatorname{rank}\left({y}_i\right)\right]}^2 \) is the squared difference between the rank of a score in X and a score in Y.

Kendall-based estimation (CL T)

Kendall’s tau (rt) measures the strength of association between X and Y, which can be plugged into Eq. 4 to obtain an estimate of γ,

$$ {CL}_T=\left[{\sin}^{-1}\left({r}_t\right)/\pi \right]+.5=\left\{{\sin}^{-1}\left\{\left({n}_C-{n}_D\right)/\left[0.5\cdot n\left(n-1\right)\right]\right\}/\pi \right\}+.5, $$
(9)

where nC refers to the number of concordant pairs for X and Y, and nD refers to the number of discordant pairs for X and Y.

Regression-based estimation (CL L)

To obtain a robust correlation (rr), one can fit a regression line that regresses the standardized Y scores (ZY) on the standardized X scores (ZX) based on a robust estimation procedure (e.g., M-estimation; Wilcox, 2012)—that is,

$$ {Z}_Y={\beta}_0+{\beta}_1{Z}_X+e, $$
(10)

where the standardized slope β1 denotes a robust correlation between X and Y that can be plugged into (4) to obtain an estimate of τ,

$$ {CL}_L={\mathit{\sin}}^{-1}\left({r}_r\right)/\mathrm{pi}+.5={\mathit{\sin}}^{-1}\left({\beta}_1\ \right)/\pi +.5. $$
(11)

We hereafter refer to the various estimates of γ as PBS = (BP, CLr, CLS, CLT, and CLL) in general.

Bootstrap confidence intervals

In addition to the point estimate of γ, its confidence interval (CI) is essential for quantifying the sampling error and making statistical inferences. For example, if a 95% CI for Bp does not span .50 it can be inferred that the PBS estimate is statistically significant at the .05 level. Bootstrapping (Efron & Tibishiri, 1993)—a nonparametric resampling procedure often executed in a computerized statistical package—can often produce trustworthy CIs for statistical measures (Chan & Chan, 2004; Li, Chan, & Cui, 2011). There are three major types of bootstrap CIs: bootstrap standard interval (BSI), bootstrap percentile interval (BPI), and bootstrap bias-corrected and accelerated percentile interval (BCaI).

Assume one has a dataset with 100 XY paired observations. First, this dataset is resampled with replacement B times (e.g., 1,000) to produce N bootstrap datasets that contains the same sample size (i.e., 100) as the original dataset. Second, for each of the N = 1,000 bootstrap datasets, the PBS point estimates (denoted as PS in Equations 12 to 16) are computed using a statistical package, thereby producing 1,000 bootstrap \( {PS}^{\ast }={B}_P^{\ast }(b) \), \( {CL}_r^{\ast }(b) \), \( {CL}_S^{\ast }(b) \), \( {CL}_T^{\ast }(b) \), and \( {CL}_L^{\ast }(b) \) estimates, where b = 1, 2, . . . , N. Given these bootstrap estimates, the statistical package can construct the 95% BSI,

$$ \mathrm{BSI}=\hat{PS}\pm 1.96\cdot {s}_{PS}^{\ast }, $$
(12)

where \( \hat{PS} \) is an estimated BP, CLr, CLS, CLT, and CLL respectively, from the original dataset, and \( {s}_{PS}^{\ast } \) refers to the standard error for BP, CLr, CLS, CLT, and CLL respectively, on the basis of the standard deviation of the 1,000 bootstrap samples.

A second method is known as the 95% BPI,

$$ \mathrm{BPI}=\left({PS}^{\ast }(l),{PS}^{\ast }(u)\right), $$
(13)

where l is the 2.5 percentile rank and u is the 97.5 percentile rank of the 1,000 bootstrap \( {PS}^{\ast }={B}_P^{\ast }(b) \), \( {CL}_r^{\ast }(b) \), \( {CL}_S^{\ast }(b) \), \( {CL}_T^{\ast }(b) \), and \( {CL}_L^{\ast }(b) \) estimates, respectively.

A third type of bootstrap CI is the 95% BCaI that adjusts for any skewness in the original dataset,

$$ \mathrm{BCaI}=\left({PS}^{\ast}\left({l}^{\ast}\right),{PS}^{\ast}\left({u}^{\ast}\right)\right), $$
(14)

where the l and u are lower and upper percentile ranks, which are adjusted to be different from l = 2.5% and u = 97.5% in Eq. 14, depending on the skewed level of the original dataset. Two correction factors, i, and j, are required to estimate l and u. The first factor i, is used to correct for the overall bias of the bootstrap PS estimates, which deviate from the estimate obtained in the original dataset. That is,

$$ i={\Phi}^{-1}\left\{\#\left[{PS}^{\ast }(b)< PS\ \right]/N\right\}, $$
(15)

where Φ−1 is normal inverse cumulative function distribution, and #[PS(b) < PS] is the count function that counts the number of the bootstrap PS estimates smaller than PS in the original dataset. The second factor (j) adjusts for the rate of change of the error of PS with respect to its true parameter value,

$$ j={\sum}_{k=1}^K{\left[ PS(.)- PS(k)\right]}^3/6\left\{{\sum}_{k=1}^K{\left[ PS(.)- PS{(k)}^2\right]}^{\raisebox{1ex}{$3$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}\right\}, $$
(16)

where PS(k) is the jackknife value of PS obtained by removing the kth row of the original dataset, and PS(.) is the mean of the n jackknife estimates. Consequently, l = N ∙ α1, where \( {\alpha}_1=\Phi \left\{i+\frac{i+{z}_{1-\left(\alpha /2\right)}}{1-j\left[i+{z}_{1-\left(\alpha /2\right)}\right]}\right\} \), and u = N ∙ α2, where \( {\alpha}_2=\Phi \left\{i+\frac{i-{z}_{1-\left(\alpha /2\right)}}{1-j\left[i-{z}_{1-\left(\alpha /2\right)}\right]}\right\} \) (for details, please see Canty & Ripley, 2016; Chan & Chan, 2004; Efron & Tibshirani, 1993; Li et al., 2011)

Simulation conditions

Our experiment investigated the performance of PBS estimators under three bivariate relationship conditions—(1) linearly related X and Y that follow bivariate normal correlation, (2) linearly related X and Y that follow nonnormal correlations (i.e., positively skewed, negatively skewed, and uniform), and (3) PBS related X and Y that follow normal, uniform, positively skewed, and negatively skewed distributions. For each of these relationships, seven levels of true effect size (γ), and four levels of sample size were evaluated.

Population effect size (seven levels)

Seven effect size levels were evaluated: γ = .50, .55, .60, .65, .70, .75, and .80. When assumptions for r are met, these values can be converted using Eq. 4 to the ρ values 0, .156, .309, .454, .588, .707, and .809, providing a comprehensive span of effect sizes [zero, small (.1), medium (.3), large (.5), and extremely large (.8); Cohen, 1988].

Sample size (four levels)

Four levels, n = 20, 60, 100, and 300, were evaluated to represent a range of common sample sizes in behavioral and social science research.

Bivariate Type 1 distributions (four levels)

Type 1 is separated into Type 1a (bivariate normal and linear) and 1b (bivariate nonnormal and linear). Type 1a data meet both the PBS-based condition for Bp plus the additional linear condition for r and CLr. In other words, Type 1a meets the parametric assumptions that satisfy both r and PBS. Type 1b data are uniform, positively skewed, or negatively skewed and meet the linear condition for X and Y but do not necessarily meet the PBS condition for X and Y. The X scores were simulated to behave different from normality, and Y scores simulated conditional on the X scores for a level of ρ. We expected that Bp and CLr would behave differently, because the necessary condition (bivariate normal correlation) that links Bp to CLr was violated.

Bivariate Type 2 distributions (4 × 3 levels)

The aforementioned distributions do not allow for a full demonstration of potential PBS-based relationships, as they are based on the widely employed concept of linearly related X and Y. An important gap in previous research is that PBS-based relationships that are free of linearity have been ignored. These bivariate relationships can be directly derived from the function in Eq. 6. Given this function, X can be generated from any distributions (e.g., normal, uniform, positive skewed, negative skewed), and Y can be generated from Eq. 6 with a manipulated level of γ. In this study, three levels of c\( \sqrt{3}/2 \), \( \sqrt{12}/2 \), and \( \sqrt{48}/2 \)—in Eq. 6 were examined, which produced three types of uniformly distributed Y scores with SDs = 0.25, 1, and 4, which have a smaller, identical, and larger SD than the SD (i.e., 1) of the X scores. For SDs of Y equal to either .25 or 4, the generated X scores appear to contain outliers. These distributions reflect scenarios in which some variables contain larger variance and have longer tails than others.

This experiment was designed with 7 × 4 × 4 = 112 simulation conditions that meet the linear condition and 7 × 4 × 4 × 3 = 336 simulation conditions that meet only the PBS conditions for a total of 448 simulation conditions. Each condition was replicated 1,000 times, and for each replication the dataset was bootstrapped 1,000 times to generate the three bootstrap CIs. This produced a total of 448(conditions) × 1, 000(replication) × 1, 000(bootstrap) = 448,000,000 simulated datasets.

Procedure

Type 1

To generate the simulation data, first, X scores were generated from a normal distribution, N(0, 12), which meets the linear condition. Second, X scores were generated from a uniform distribution, U(\( -\sqrt{12}/2 \), \( \sqrt{12}/2 \)), with mean = 0, SD = 1; this mimics scenarios in which scores are evenly and uniformly distributed in a sample. Third, X scores were generated from a lognormal distribution, lnN(−0.3456, 0.83262), so that the mean is 1 and SD is 1. This forms a positively skewed distribution, with skewness = 4 and kurtosis = 38, commonly found in behavioral and social sciences, for example, in data from biological measures (e.g., Wilcox, Granger, Szanton, & Clark, 2014) and measures of affect and depression levels (e.g., Tomitaka et al., 2016). Fourth, X scores were generated on the basis of – 1 multiplied by Xa scores, which were generated from lnN(−0.3456, 0.83262). Hence, the generated X scores follow a negatively skewed distribution with mean = – 1, SD = 1, skewness = – 4, and kurtosis = 38. Given the generated X scores, for Type 1, the linear-related Y scores were generated from

$$ Y=\rho X+{e}_Y, $$
(17)

where ρ is the population Pearson’s correlation converted from the population PBS (γ) through Eq. (4), X refers to the simulated scores from Type 1 or 2, and eY is the error score generated from a normal distribution with mean = 0, and variance = 1 − ρ2. Given this method, X and Y are expected to be linearly correlated with a level of ρ.

For Type 2, according to Eq. 6, ϱ values were generated from a uniform distribution, U(0, 1), with min = 0 and max = 1. Second, to allow sampling error in each replicated sample, the γ values were generated from a binomial distribution, B(n, γ). Given the generated ϱ and γ for each simulated respondent, and the known X scores, the Y scores were generated from either a uniform distribution [i.e., \( U\left(\overline{X},c\right),U\left(-c,\overline{X}\right)\Big] \) or set at the mean of Y as shown in Eq. 6. Note that to allow for sampling error, the sample mean estimate \( \overline{X} \) was used instead of μX in Eq. 6.

Consequently, for each replication, a dataset was generated containing both the X and Y scores that were used to compute the PBS estimates of γ. This dataset was also used to generate the 95% BSI, BPI, and BCaI (with 1,000 bootstrap replications). Thus, for each condition 1,000 PBS = (BP, CLr, CLS, CLT, and CLL) estimates and their associated 95% BSI, BPI, and BCaI were obtained. The simulation was conducted in R (R Core Team, 2016). Note that the code called two packages, boot (Canty & Ripley, 2016) and MASS (Venables & Ripley, 2002), that executed the bootstrap procedures and performed the robust linear regression, respectively. The simulation code is available in the supplementary materials.

Results

Evaluation criteria

For Distribution Types 1a and 2, two evaluation criteria are used to assess the performance of each of the PBS estimates of γ. First, percentage bias was computed as \( \mathrm{bias}=\left[\left(\overline{PS}-\gamma \right)/\gamma \right] \), where \( \overline{PS} \) is the mean of the 1,000 PBS estimates (expressed as BP, CLr, CLS, CLT, and CLL) in 1,000 simulated samples. A PBS estimate was considered reasonable if the bias was within ± .10 (or 10%; Li et al., 2011). This bias only examines the performance of a PBS estimate in one condition. To evaluate overall performance across all 336 nonparametric conditions a second criterion was used: the mean-absolute percentage bias (MAPE) is defined as \( \mathrm{MAPE}=\left({\sum}_{s=1}^{336}\left|\mathrm{bias}(s)\right|\right)/336 \). A MAPE smaller than .10 (or 10%) is considered reasonable (Li et al., 2011). MAPE was also used to separately evaluate the performance of the 28 Type 1a linear and normal conditions, in which \( \mathrm{MAPE}=\left({\sum}_{s=1}^{28}\left|\mathrm{bias}(s)\right|\right)/28 \)., and the 84 Type 1b linear and nonnormal conditions, in which \( \mathrm{MAPE}=\left({\sum}_{s=1}^{84}\ \left|\mathrm{bias}(s)\right|\right)/84 \). Regarding the performance of bootstrap CIs, given that 95% BSI, BPI, and BCaI were constructed, coverage was expected to be 950 out of 1,000 replications [or expressed as coverage probability (CP) = .95]. But sampling error makes it impractical for researchers to obtain a perfect CP of .95. We considered acceptable an observed CP that falls within the range (.925, .975) (Chan & Chan, 2004).

For Distribution Type 1b, the purpose is to evaluate the difference between Bp and CLr (i.e., =Bp − CLr ) when XY points were generated from nonnormal distributions (i.e., positively skewed, negatively skewed, uniform) with an associated true correlation value (ρ). Given that X was generated from a nonnormal distribution, and Y was generated from a linear model (Eq. 17) conditional on a true correlation value (ρ) related to X, the equivalence for Bp ≡ CLr [i.e., \( {B}_P=\frac{\sum \limits_{i=1}^n\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)>0\right]+0.5\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)=0\right]}{n}\equiv \frac{1}{\pi }{\sin}^{-1}(r)+0.5={CL}_R; \) Eq. 8 in the Appx.] becomes invalid. Hence, even though we know the true ρ value in our simulation, the associated true PBS value,\( \gamma =P\left(Y>\overline{Y\ }\right|\ X>\overline{X}\Big) \), is unknown and is unique for every type of nonnormal distribution of X. As a result, it is inaccurate for us to evaluate biases and coverage probabilities when γ is unknown.

Performance under linear conditions (Table 1)

Type 1a

As expected, across the 28 conditions in which linear and normal conditions were met, the performances of BP and CLr were highly comparable. The biases of BP ranged from – .005 to .008, with mean .000 and SD .003, indicating an excellent fit. Of the 28 conditions, all biases fell within the criterion of ± .10 (or 10%), and MAPE was .002. For CLr, the biases ranged from – .007 to .004, with mean .000 and SD .002, showing an excellent fit. All the biases were within the criterion of ± .10 (or 10%), and MAPE was .001. Regarding the bootstrap CIs, the mean of the 28 CPs yielded by BSI, BPI, and BCaI for BP were .966, .978, and .913, respectively, which are comparable to those obtained by BSI, BPI, and BCaI for CLr, i.e., .932, .937, and .942, respectively. Thus, both the new PBS procedure (producing the BP estimate) and the traditional CLr are trustworthy and appropriate when the parametric assumptions are met: Both methods produced similar results with minimal bias.

Table 1 Coverage probabilities (CP) and percentage biases of the point estimates of BP and, CLr when the linear and normal conditions are met (Type 1)

Type 1b

As is shown in Fig. 4, BP consistently produced a larger estimate than CLr under a uniform distribution: The difference (d) values ranged from .000 to .034, with a mean of .015. For positively and negatively skewed distributions, BP consistently produced a smaller estimate than CLr: Here the d values ranged from – .046 to .001, with a mean of – .022. These results indicate that BP and CLr are different—that is, \( {B}_P=\frac{\sum \limits_{i=1}^n\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)>0\right]+0.5\#\left[\operatorname{sign}\kern0.15em \left({x}_i-\overline{x}\right)\cdot \operatorname{sign}\kern0.15em \left({y}_i-\overline{y}\right)=0\right]}{n}\ne \frac{1}{\pi }{\sin}^{-1}(r)+0.5={CL}_R \), when the linear and nonnormal conditions are met.

Fig. 4
figure 4

Means of 1,000 replicated estimates for BP minus CLr (i.e., d scores) across the 84 conditions in which linear and nonnormal conditions are met (Type 1b).

Performance under PBS conditions in the absence of linearity

Point estimates

Among the five PBS estimates, BP performed best and was the only estimate considered reasonable across both evaluation criteria, as is shown in Fig. 5. Across the 336 conditions in which parametric assumptions were violated, the biases ranged from – .059 to .057, with mean – .018 and SD .017, indicating a good estimate. All 336 conditions produced a BP estimate within the criterion of ± .10 (or 10%), showing excellent fit. Overall, the MAPE (.019) was also well within the criterion of ± .10 (or 10%), which demonstrates that BP is an appropriate and robust estimator of γ across all the simulation conditions.

Fig. 5
figure 5

Biases for BP, CLr, CLS, CLT, and CLL when only the PBS condition is met (Type 2)

The performance of the remaining four PBS estimates was less than optimal. The parametric-based CLr performed poorly and did not reliably detect PBS relationships. The biases ranged from – .234 to .007, with mean – .119 and SD .070. Of the 336 conditions, only 128 (38.1%) yielded a CLr estimate within ± .10. Overall, the MAPE for CLr was outside the criterion (.119), further demonstrating that CLr is not an optimal estimator for γ when X and Y are not linearly related.

Two robust-based PBS estimates performed slightly better than CLr but neither performed adequately. The CLS biases ranged from – .234 to .007, with mean – .113 and SD .070. Of the 336 conditions, 144 (or 42.9%) produced a CLS estimate within ± .10, and the MAPE was .114. The CLL biases ranged from – .210 to .007, with mean – .110 and SD .064. Of the 336 conditions, 139 (or 41.4%) produced a CLL estimate within ± .10, and the MAPE was .110. The last robust estimate, CLT, performed worse than the parametric CLr. These biases ranged from – .267 to .004, with mean – .146 and SD .086. Of the 336 conditions, only 96 (or 28.7%) produced a CLT estimate within ± .10, and the MAPE was .146.

Confidence intervals

As is shown in Fig. 6, two of the three bootstrap CI procedures provided acceptable coverage probabilities (CPs) for BP. The 95% BSI for BP performed best: Across the 336 conditions, the CPs obtained from the 95% BSI ranged from .666 to .986, with mean .954 and SD .043. Of the 336 conditions, 280 (or 83.3%) conditions produced a CP within the criterion of (.925, .975), indicating good fit. The 95% BPI for BP, yielded CPs that ranged from .422 to .998, with mean .934 and SD .084. Of the 336 conditions, 229 (or 68.2%) conditions produced a CP within (.925, .975). However, the 95% BCaI for BP was less than optimal: The CPs ranged from .652 to .964, with mean .901 and SD .041. Of the 336 conditions, 109 (or 32.4%) conditions produced a CP within (.925, .975).

Fig. 6
figure 6

Coverage probabilities when only the PBS condition is met (Type 2). BSI = bootstrap standard interval, BPI = bootstrap percentile interval, BCaI = bootstrap bias-corrected and accelerated percentile interval

Given that the bias of the point estimates based on CLr, CLS, CLT, and CLL are outside a reasonable range, the associated BSI, BPI, and BCaI values are likewise less than optimal. For CLr, the coverage probabilities yielded from BSI ranged from 0 to .957 with a mean of .443. Of the 336 conditions, only 29 (or 8.6%) fell within the criterion of [.925, .975]. For BPI, range = (0, .958), mean = .451, and 38 (11.3%) fell within the criterion. For BCaI, range = (0, .967), mean = .433, and 54 (or 16.1%) fell within the criterion. For CLS, BSI produced a range of (0, .968), mean = .478, and 38 (11.3%) fell within the criterion; BPI yielded a range of (0, .968), mean = .477, and 64 (or 19.0%) fell within the criterion; BCaI resulted in a range of (0, .969), mean = .467, and 64 (or 19.0%) fell within the criterion. For CLT, BSI produced a range of (0, .968), mean = .308, and 54 (16.1%) fell within the criterion; BPI yielded a range of (0, .966), mean = .309, and 54 (or 16.1%) fell within the criterion; BCaI resulted in a range of (0, .974), mean = .304, and 57 (or 17.0%) fell within the criterion. For CLL, BSI produced a range of (0, .961), mean = .495, and 36 (10.7%) fell within the criterion; BPI yielded a range of (0, .959), mean = .503, and 48 (or 14.3%) fell within the criterion; BCaI resulted in a range of (0, .968), mean = .483, and 61 (or 18.2%) fell within the criterion. Because only the point estimates and BSI for BP yielded reasonable results overall, the following discussion of the effects of the manipulated factors focuses only on the point estimates and BSI for BP.

Effects of manipulated factors on B P and BSI

The effects of the manipulated factors on the performance of Bp were minimal, as is shown in Table 2. There were no obvious effects of varying the distribution of Y on Bp and BSI. The most influential factor was the distribution of X (θ): When X is positively or negatively skewed, the point estimate biases were slightly more negative (except when γ = .80, n = 20, and θ = negatively skewed). For example, when γ = .80 and n = 300, the magnitude of the biases increased from – .009 (θ = normal) and – .008 (θ = uniform) to – .055 (θ = positively skewed) and – .054 (θ = negatively skewed). This may be due to the sample mean estimates (\( \overline{x} \) and \( \overline{y} \)) in Eq. 7, which become less robust estimates of the center of the distribution when there is a long tail. Second, when n was increased from 20 to 300, the accuracy of Bp improved, especially when the θ distribution is normal or uniform. This is reasonable, because a good sample estimate should be asymptotically assumed, meaning that when n → ∞, BP → γ. Third, when γ increased from .50 to .80 and other factors were held constant, the biases of Bp became slightly more negative. This pattern is reasonable because γ has an upper bound of 1, which results in fewer sample Bp estimates above γ = .80 than below γ = .80. Generally, the effects of the manipulated factors on Bp are indeed minimal, and hence, these results demonstrate that BP is a trustworthy estimator for γ across the conditions examined in this study.

Table 2 Percentage biases of BP and coverage probabilities of BSI in selected conditions

For BSI, the two worst coverage probabilities (.687 and .695) resulted when γ = .80, n = 300, and θ = positively or negatively skewed. This is understandable because BSI depends upon the accuracy of the point estimate of Bp and under these conditions the least accurate point estimates were found. Also, these conditions are quite strict—that is, a relatively large γ = .80 (capped at 1), and a challenging skewed distribution (skewness = 4 or – 4 and kurtosis = 38), and a relatively narrow BSI (because of a large sample size). Hence, a narrow BSI becomes more sensitive to slight deviation from Bp to γ, and this inevitably results in smaller CPs. When other factors (n, γ, and θ = normal or uniform) were manipulated, the CPs of the BSI were robust across these conditions, demonstrating that BSI is a good CI construction procedure for Bp.

Working example

This section illustrates how researchers can obtain the Bp estimate of γ and its bootstrap CIs for their dataset using R (R Core Team, 2016; or RStudio: RStudio Team, 2016), a free and popular statistical package in behavioral and social sciences. We have made available the code for a function that computes the Bp estimate, together with a sample dataset and step-by-step instructions, in the supplementary materials. First, copy the code (function name: pbs) and execute it in R. Second, enter the X and Y scores from supplied in the supplementary materials to form a 100 × 2 data matrix in R. To best demonstrate how the code works, we have simulated the X and Y scores so that the population parameter γ is known and can be used to evaluate the accuracy of sample estimate BP. In this example, the X and Y scores were generated from γ = .60, n = 100, θ has a uniform distribution, and \( {\sigma}_Y=\sqrt{12}/2 \). Third, run the syntax pbs(data, 1000, .95, 1234, 4) in R, where data refers to the name of the 100 × 2 data matrix, 1000 refers to the number of bootstrap samples, .95 is 95% CI, 1234 is the seed number, and 4 is the number of decimal places displayed in the output. If a researcher chooses to use these default settings, the syntax can further be simplified to: pbs(data). Alternatively, a researcher could alter any of the arguments to suit a study’s particular needs (e.g., change the confidence interval to 99% by entering .99 in place of .95 in the third argument).

Once R finishes running the code, the results will be displayed (see Step 4 of the Appendix). In this case, BP = .61, and we obtained a 95% BSI = (.5031, .7169), which indicates a statistically significant result at the .05 level because the range of the BSI confidence interval does not span .50. For purposes of interpretation, this BP = .61 estimate tells us that there is a statistically significant 61% chance that when an X score is above the mean of all X scores, the paired Y score is also above the mean of all Y scores.

For purposes of comparison only (see note 2), computing the correlation for this simulated dataset produces r = .1210, p = .2306, which is nonsignificant at the .05 level. We compute the CLr estimate of γ using Dunlap’s conversion formula (Eq. 4) to get an estimate of .5386. The CLr estimate is a biased estimate of γ, as should be expected, because the underlying X-Y relationship is not linear. Furthermore, a researcher that computes this estimate may mistakenly infer that because the correlation is not statistically significant there is also no significant PBS relationship between X and Y. However, using BP to estimate γ and the 95% BSI to test for the statistical significance of this estimate, we can correctly identify a significant bivariate relationship that would be missed using traditional correlational analysis.

Real-world examples

For purposes of demonstration only,Footnote 3 this section presents how researchers could lead to different conclusions if they specify a different hypothesis (i.e., linearity vs. PBS) and use a different statistic (i.e., r vs. Bp) to test this hypothesis. We suggest that identifying PBS relationships in real world research can contribute valuable information not revealed by traditional correlational analysis. To demonstrate this contribution across different disciplines, we randomly selected two recently uploaded bivariate datasets for analysis from Ontario Data Documentation, Extraction Service and Infrastructure (ODESI), a Web-based digital repository for social sciences data.

The first dataset (Mychasiuk, 2017) contains behavioral data for 74 adolescent rats with mild traumatic brain injury (RmTBI) after the consumption of caffeine. We obtained r and CLr estimates for the relationship between weight (weightTBI) and the average time to right (AverageTimeToRight). A scatterplot of the data is shown in Fig. 7. The results show that r = .158, 95% CI = (– .083, .367), indicating no statistically significant linear relationship between these variables. Using Dunlap’s (1994) transformation, we obtain CLr= .551 as an estimate for γ. However, when directly computing an estimate for γ using our PBS method, we find BP= .581, 95% BSI = (.455, .701). BP estimates a stronger PBS relationship than does CLr, although the 95% CI still spans the .50 PBS zero effect.

Fig. 7
figure 7

Scatterplots for the real-world datasets. The left panel shows a scatterplot of the weight (weightTBI) of 74 adolescent rats to their average time to right (AverageTimeToRight) in the first real-world dataset (Mychasiuk, 2017). The right panel shows a scatterplot of concentrations of water (H2O) and carbon dioxide (CO2) in the second real-word dataset (Wunch et al., 2017)

There is an important lesson we can take from this result. If we use Eq. 4 to convert our BP estimate to the r-metric, the value is .252, which suggests a noticeably larger correlation than the actual r value obtained in the correlational analysis. This is not to suggest that the correlational analysis provided an underestimate of the linear relationship. To the contrary, we suggest that treating a measure of the PBS relationship as an effect size to describe a correlation is misleading and can result in precisely such incorrect inferences.

Dunlap’s (1994) Eq. 4 implies that a PBS relationship requires the existence of a linear relationship and that the PBS estimate can describe the linear relationship. This is not the case. The present example demonstrates the type of problem that could arise as a result of treating CLr or any other estimate of PBS as an effect size for a linear relationship. Although linearity implies the existence of a PBS relationship, the existence of a PBS relationship does not imply linearity. Being able to directly compute an estimate of the magnitude of a PBS relationship without dependence on correlational analysis should help reduce any confusion about the relationship between these bivariate forms. Our PBS algorithm makes reliance on the CLr conversion formula unnecessary and will allow PBS to be appreciated independently of r.

The second dataset (Wunch, Arrowsmith, & Heerah, 2017) comes from an environmental study that measured the density of chemicals in a bike cargo trailer 362 times between June 28th to July 19th, 2017. We computed r and BP estimates for the relationships between water (H2O) and carbon dioxide (CO2), r = – .079, 95% CI = (– .180, .024), which is a nonsignificant result. By conversion using Eq. 4, we found CLr = .475, but we obtained a very different result by direct estimation, BP = .583, 95% BSI = (.518, .648). A scatterplot of the data is shown in Fig. 7. Here we have identified a statistically significant PBS effect in the absence of a linear relationship. Relying upon correlational analysis, a researcher would conclude that there is no important relationship between H2O and CO2. The use of PBS analysis to obtain BP leads to a very different inference: There is a significant probability-of-bivariate-superiority relationship between H2O and CO2, such that when the H2O level is above the mean H2O level, there is a 58.3% chance that the CO2 level will be above the mean CO2 level. This example illustrates the advantage of conducting PBS analysis when seeking to understand the relationship between two variables. Not only does PBS identify an important relationship when it exists, but it is also a relationship that is easy to communicate, making the dissemination of research findings more impactful.

Conclusions and discussion

In this article we have described a new statistical procedure to estimate the probability of bivariate superiority, an important type of bivariate relationship. Although little previous work has addressed this type of bivariate relationship, statisticians have suggested its importance under the framework of the copula theory (Jaworski et al., 2010). In addition, it has previously been suggested as a more understandable way to describe a bivariate relationship (Dunlap, 1994). PBS is not simply a concept translated from bivariate normal correlation (r), as is implied by Dunlap’s CLr. Rather, it is a unique theoretical and statistical model for quantifying bivariate relationships under the copula theory. PBS describes how likely it is that an X score that is above (or below) the mean is associated with a Y score that is above (or below) the mean.

In a Monte Carlo experiment we simulated data from bivariate relationships under a variety of conditions, many of which violated the parametric and linearity assumptions of conventional correlational analysis. We used our new PBS algorithm to compute a point estimate, BP, of the true PBS, γ and confidence interval in each of 448 conditions. The results of the experiment demonstrate that BP can appropriately identify this type of bivariate relationship, is robust to the simulation conditions that deviated from conventional parametric assumptions, and allows for inferences to be made as to the statistical significance of the bivariate relationship through the use of bootstrap CIs. Moreover, the likelihood-based interpretation of BP is more understandable and interpretable than conventional r-based interpretations of bivariate relationships (i.e., proportion of variance explained).

The different BP and CLr estimates found in Distribution Type 1b suggest important implications about the use of CLr in practice. These results suggest that CLr estimates the PBS parameter γ differently than BP, when the condition of linearity and nonnormality is met. Although we do not exactly know the true population γ value, we believe that the BP procedure, which directly counts the number of times that \( Y>\overline{Y\ } \) and \( X>\overline{X} \) in a sample, appears to better measure and is more consistent with the concept of \( \gamma =P\left(Y>\overline{Y\ }\cap X>\overline{X}\right) \) that defines the PBS parameter as compared to CLr, which is based on the r-to-CLr conversion, \( \frac{1}{\pi }{\mathit{\sin}}^{-1}(r)+0.5 \). In short, at least we are safe to conclude that CLr is of limited use when a researcher uses an r value and converts it to CLr given that X and Y are linearly related but they are nonnormal. This result also supports Dunlap’s (1994) suggestion that researchers can use the r-to-CLr conversion, when X and Y follow bivariate normal correlation.

We provided a working example and an example from real world research that demonstrated that PBS can identify an important and significant bivariate relationship in the absence of a significant linear correlation. The correlation in the working example was nonsignificant at the .05 level, r = .1210, p = .2306. However, PBS analysis identified a bivariate superiority relationship, BP = .61, p < .05. Critically, the examples we have outlined, together with our Monte Carlo results, imply that CLr is not an adequate procedure for detecting PBS relationships because (a) a researcher that finds an r that is not statistically significant is unlikely to bother to transform that r to CLr, (b) the transformation to CLr provides no information about the significance of the effect unless an appropriate bootstrap technique is used to construct a confidence interval, and (c) when linearity and parametric assumptions are violated the transformation from r to CLr leads to biased estimates of γ that could lead to erroneous inferences. Hence, researchers that have relied upon r or its related models (e.g., linear regression model) to evaluate bivariate relationships, even those that have transformed their results to CLr, may have missed PBS relationships that are important for theory testing and model building in behavioral and social sciences.

We have outlined a solution to these problems and have specified a reliable method to directly compute a point estimate of γ that is robust to violations of parametric assumptions and provides results that are more easily communicated for research dissemination. We have proposed PBS as a new statistical tool that can be used in future research to identify the probability of superiority in bivariate relationships. In addition, the effect size estimate produced by PBS analysis, BP, is a common-language effect size that can make communicating the character of the bivariate relationship more successful. Finally, we propose PBS as a statistical analysis that behavioral and social science researchers can apply to past research. We encourage researchers to reexamine bivariate relationships in their datasets to find theoretically important effects that had been previously overlooked.

Limitations and future directions

We were inspired by the ground-breaking work of others (Dunlap, 1994; Grissom, 1994; McGraw & Wong, 1992; Vargha & Delaney, 2000; Wolfe & Hogg, 1971) to develop PBS, and we constructed this new procedure upon the solid foundations they built. However, PBS is a new statistical procedure, and there is much work to be done to provide a more complete picture of PBS and its theoretical and practical applications. The nature of conducting a Monte Carlo experiment is that one must choose from among many variables that can be manipulated, and choose the levels at which each variable will be tested. Although we are confident that we have chosen the most important variables for an initial test of our procedure, and levels for each that are adequately representative of many common data circumstances encountered in real research scenarios, we recognize that there are other conditions of import under which PBS should be evaluated. In particular, additional nonparametric distributions of X (e.g., bimodal, U-quadratic, normal-ogive, logistic) and how this “glues” to γ with a particular level of γ should be considered in future research.

A priority for advancing PBS, both theoretically and practically, is the development of generalized forms of BP for use in research scenarios involving more than two variables. Behavioral and social science researchers often investigate how an outcome measure (or criterion) can be regressed on multiple predictor variables through regression analysis. Additional research is necessary to examine how the PBS concept can be extended and generalized to more complex research situations of this nature.

Another future direction for the development of PBS involves examination of the diagnostic value of the probability-of-superiority conceptualization. The basic idea of PBS focuses on the likelihood that when an X score (e.g., daily exercise) is above (or below) the mean, the paired Y score (e.g., hospitalization) is also above (or below) the mean. Consequently, researchers may use PBS information to classify or diagnose individual participants into a 2 × 2 profile—that is, daily exercise (good or bad) and chance of hospitalization (high or low). Future research is necessary to investigate the accuracy and usefulness of these types of PBS-implied diagnostic profiles for each individual participant in a research study. We expect this approach will lead to development of a PBS-based method for individualized diagnostic information to be communicated to people in a way that is both understandable and useful to them.

Whereas simulation studies, such as the Monte Carlo experiment we have presented herein, are effective for demonstrating the performance of a statistical procedure under a wide range of conditions, they are often not sufficient to convince the cautious research community to adopt new techniques of analysis. Therefore, it is necessary to begin applying PBS to existing datasets in the behavioral and social sciences literature. We invite researchers to undertake this task independently, and we invite collaborations, to explore the degree to which significant PBS relationships have been missed in previously published and unpublished datasets. This undertaking will accomplish three important objectives: (a) It will provide a proving ground for PBS in real-data scenarios, (b) it will enable researchers to familiarize themselves with the PBS procedure and interpretive structure, and (c) it will allow researchers to identify previously overlooked bivariate relationships of theoretical importance to their research.

Author note

This research was supported by the University Research Grants Program (URGP) to Johnson Ching-Hong Li in the Department of Psychology at the University of Manitoba (#47094).