1 Introduction

Despite the undeniable criticism for loss of information and power (Kuss 2013; Altman and Royston 2006; Royston et al. 2006), dichotomization of variables is still frequently used in social, behavioral, and medical sciences (Altman and Royston 2006). This is mostly due to the fact that dichotomization can yield more interpretable conclusions, supporting the dissemination of research outcomes (Demirtaş and Vardar Acar 2017), and can provide criteria for decision making. For example, in clinical practice, dichotomization of systolic blood pressure as under/over 140 mmHg or birth weight as under/over 2500 g is used to establish thresholds for treatment (Peacock et al. 2012). The technique is also used to model insurance losses to support decision making in the actuarial industry (Tomarchio and Punzo 2020).

The choice of cut-point for dichotomization could be based on domain knowledge, such as biology, or based on statistical factors. In this paper, we consider a statistical approach to determining cut-points, in which the dichotomy is artificial. For example, an obese vs. non-obese categorization is based on an underlying distribution of a continuous variable, such as body mass index. A common form of dichotomization or categorization of a continuous variable is to split the categories at the median or quartiles. Cut-point choices should be properly justified.

For any transformation \(g\) on X, the R-squared from regression of X on \(g\left( X \right)\), or simply the squared Pearson correlation, \({\text{Corr}}\left( {X,{ }g\left( X \right)} \right)^{2}\), can be interpreted as the proportion of the information of X explained by \(g\left( X \right)\). Therefore, \(1 - {\text{Corr}}\left( {X,{ }g\left( X \right)} \right)^{2}\) can represent the proportion of the loss of information in \(X\) due to transformation X\(\to g\left( X \right)\). Consequently, for dichotomization transformation \(g\left( X \right) = X_{d}\), the information loss is \(1 - {\text{Corr}}\left( {X,{ }X_{d} } \right)^{2}\), where \({\text{Corr}}\left( {X,{ }X_{d} } \right)\) is known as point-biserial correlation. Since dichotomization is highly criticized for its information loss, we use point-biserial correlation to assess the information loss performance among choices of cut-points.

If \(X\) is normally distributed and is dichotomized at the mean or median as \(X_{d}\), then \({\text{Corr}}\left( {X,{ }X_{d} } \right)^{2} = 2/\pi \approx 0.8^{2}\) (Cohen 1983). For a bivariate normal pair (X, Y) with correlation r, dichotomizing one of them at its median reduces the correlation to 0.8r, or equivalently, it leads to a nearly 36% (1 − 2/\(\uppi\)) reduction in R-squared (MacCallum et al. 2002). Theories of point-biserial correlation are well-established when the underlying distribution is Gaussian. Further, Demirtaş and Hedeker (2016) investigated it for log-normal, student-t, beta, and uniform-exponential distributions. However, its properties remain unclear for underlying Gaussian mixture distributions with different levels of bimodality.

1.1 Bimodal distribution and Gaussian mixtures

Bimodal distributions often occur when data are composed of observations from two different groups of subjects (Devore and Peck 1997, p. 43; Fiorio et al. 2010), where the groups are characterized by binary traits such as gender or disease status, for example, the heights of men and women (Schilling et al. 2002), birthweight at certain gestational age ranges (Haglund 2007), and the distribution of speech sounds (McMurray et al. 2009). If the trait is known, then the data can be easily adjusted to restore the unimodal property in the model building. However, if the trait is missing or latent, such as unidentified genotypes, then a dichotomization may serve a proper surrogate of the latent traits.

In this paper, we aim to investigate the connection between point-biserial correlation and Pearson correlation when the underlying distribution is a Gaussian mixture, and to provide an optimal dichotomization algorithm in the sense of minimal information loss after identifying the bimodality of such distributions.

Given an explanatory variable X and an outcome Y, researchers have investigated optimal dichotomization from various perspectives. For a binary explanatory variable X (e.g., treatment vs. control) and a continuous outcome Y, several authors (Peacock et al. 2012; Ofuya et al. 2014; Sauzet et al. 2015) have proposed distributional approaches for dichotomizing the outcome Y by deriving the difference in proportions for treatment and control with a 95% confidence interval. For a binary outcome Y and a continuous predictor X, Nelson et al. (2017) reviewed common methods for dichotomizing the predictor X with the aim of maximizing certain statistics, such as the odds ratio, Youden’s statistic, the Gini index, the chi-square statistic, relative risk, and the kappa statistic. Chen et al. (2019) proposed two optimal cut-points for continuous predictors if their relationship with the hazard outcome is U-shaped.

In contrast with the above studies, we explore the impact of dichotomization by considering the case where both the explanatory variable X and the outcome Y are continuous. Given \(X_{d}\) as a binary variable after the dichotomization of X with underlying Gaussian mixture, we propose two optimal cut-points by maximizing the corresponding point-biserial correlations \({\text{Corr(}}X_{d} ,X{)}\) and \({\text{Corr(}}X_{d} ,Y{)}\).

Note that point-biserial correlation \(r_{b}\) is related to the two-sample t test as \({\text{t}} = r_{b} \frac{{\sqrt { n - 2} }}{{\sqrt { (1 - r_{b}^{2} )} }}\) when testing the difference in means. By regarding the t-statistic as the standardized difference between the means of two groups, maximizing \(r_{bx} = {\text{Corr(}}X_{d} ,X{)}\) with respect to \(X_{d}\) becomes equivalent to dichotomizing \(X\) into the two most divergent groups in terms of the largest standardized difference between the group means of \(X\). On the other hand, maximizing \(r_{by} = {\text{Corr(}}X_{d} ,Y{)}\) with respect to \(X_{d}\) is equivalent to separating \(Y\) into the two most divergent groups in terms of the largest standardized difference between group means of \(Y\). Therefore, if the bimodality of a variable is due to the mixture of two divergent groups, then a dichotomization \(X_{d}\) that divides the data into two groups with respect to the largest standardized difference may help the interpretability of the data, especially when other domain knowledge for dichotomization is unavailable.

In simple regression of Y on X or \(X_{d}\), the squares of sample correlation \({\text{Corr(}}X,Y{)}\) and \({\text{Corr(}}X_{d} ,Y{)}\) can be interpreted as the proportion of information from Y explained by \(X\) and \(X_{d}\), respectively. Thus, we define the ratio \({\text{Corr(}}X_{d} ,Y{)}/{\text{Corr(}}X,Y{)}\) as the information retention rate (IRR) to assess the performance of \(X_{d}\).

Under the following statistical model, we derive the forms of \({\text{Corr(}}X_{d} ,X{)}\) and \({\text{Corr(}}X_{d} ,Y{)}\) in Theorems 1 and 2, respectively. Based on the results, we provide the R functions to calculate the optimal cut-points. We use simulation and real-world data examples to compare the performance of the proposed approach with other commonly used dichotomizations.

1.2 Statistical models

For a continuous outcome Y and an explanatory variable X, we assume the following Gaussian mixture on X and a piecewise linear model on X and Y.

  1. (i)

    Let

    $$X_{1} \sim N(\mu_{1} ,\sigma_{1}^{2} ),X_{2} \sim N(\mu_{2} ,\sigma_{2}^{2} )\;{\text{and}}\;X = {\text{W}} \cdot X_{1} + (1 - W) \cdot X_{2} ,$$
    (1)

    where W is a binary (0,1) variable independent of \(X_{1}\) and \(X_{2}\) with Pr(W = 1) = γ. Then, X is a Gaussian mixture of \(X_{1}\) and \(X_{2}\) with proportion γ with probability density functions (pdf) γ\(\cdot f_{1} (X) + (1 - \gamma ) \cdot f_{2} (X)\), where \(f_{i}^{'} s\) are pdf of \(X_{i}\).

  2. (ii)

    Let

    $$Y_{i} = \beta_{i} X_{i} + \varepsilon_{i} , { }i = 1,{ }2,{\text{ where}}\;\varepsilon_{i} { }\sim {\text{ N(}}0,\tau_{i}^{2} {)}\;{\text{is independent noise}}.$$
    (2)

    Then, \((X_{i} ,Y_{i} )\) has the following bivariate normal distribution:

    $${ (}X_{i} ,Y_{i} {)}\sim {\varvec{N}}(\mu_{i} , \mu_{yi} = \beta_{i} \mu_{i} , \sigma_{i}^{2} ,{ }\sigma_{yi}^{2} = \tau_{i}^{2} + \beta_{i}^{2} \sigma_{i}^{2} , \rho_{i} )\;{\text{with the joint pdf}}\;f_{i} (X,{\text{Y}}),$$

    where \(\rho _{i}=\beta _{i}\sigma_{i}/\sqrt{\tau_{i}^{2}+\beta_{i}^{2}\sigma _{i}^{2}}\), and the joint pdf of \((X, Y)\) is \(\gamma \cdot f_{1} (X,Y) + (1 - \gamma ) \cdot f_{2} (X,Y).\)

  3. (iii)

    Let \(X_{d} (h) = \left\{ {\begin{array}{*{20}c} {1 if x \ge h} \\ { - 1 if x < h} \\ \end{array} } \right.\) be the dichotomization of \(X\) with cut-point at h.

2 Results

Theorem 1

Point-biserial correlation of X and its dichotomization \(X_{d} (h)\):

$${\text{Corr(}}X_{d} (h),X{)} = \frac{{(\lambda_{3} + \lambda_{4} ) + 2\gamma (1 - \gamma )(\mu_{1} - \mu_{2} )\left( {\Phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right) - \Phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right)}}{{\sqrt {1 - \left( {\lambda_{5} + \lambda_{6} } \right)^{2} } \cdot\sqrt {\lambda_{1} + \lambda_{2} } }},$$
(3)

where ϕ and Φ are the pdf and cumulative density function (cdf) of the standard normal distribution, and

  • \(\lambda_{1} = \gamma (1 - \gamma )(\mu_{1} - \mu_{2} )^{2} ;\lambda_{2} = \gamma \cdot \sigma_{1}^{2} + (1 - \gamma )\sigma_{2}^{2} ,\)

  • \(\lambda_{3} = \gamma \left( {2\sigma_{1} \phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right), \lambda_{4} = \left( {1 - \gamma } \right)\left( {2\sigma_{2} \phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right)} \right);\)

  • \(\lambda_{5} = \gamma \left( {1 - 2\Phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right), {\text{and}} \lambda_{6} = \left( {1 - \gamma } \right)\left( {1 - 2\Phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right)} \right).\)

Theorem 2

Point-biserial correlation of outcome Y and the dichotomization \(X_{d} (h)\):

$${\text{Corr(}}X_{d} (h),{\text{Y)}} = \frac{{(\beta_{1} \lambda_{3} + \beta_{2} \lambda_{4} ) + (\beta_{1} \mu_{1} \lambda_{5} + \beta_{2} \mu_{2} \lambda_{6} ) - \mu_{y} (\lambda_{5} + \lambda_{6} )}}{{\sqrt {1 - (\lambda_{5} + \lambda_{6} )^{2} } \cdot\sqrt {\lambda_{1}^{\prime } + \lambda_{2}^{\prime } + { }\tau^{2} } }},$$
(4)

where \(\tau^{2} = \gamma \cdot \tau_{1}^{2} + (1 - \gamma \cdot )\tau_{2}^{2} , \lambda^{\prime}_{1} = \gamma (1 - \gamma )(\beta_{1} \mu_{1} - \beta_{2} \mu_{2} )^{2}\) and \(\lambda^{\prime}_{2} = \gamma \cdot \beta_{1}^{2} \sigma_{1}^{2} + (1 - \gamma )\beta_{2}^{2} \sigma_{2}^{2} .\)

Note that in the case of a single Gaussian model, e.g., \(\mu_{1} = \mu_{2} = 0\) and \(\sigma_{1} = \sigma_{2} = 1\), Theorem 1 simplifies to the following well-known result:

$${\text{Corr(}}X_{d} (h),X{)} = \frac{\phi (h)}{{\sqrt {\Phi (h)(1 - \Phi (h))} }}.$$

Furthermore, when the split is at the mean, i.e., h = 0, this gives \(\sqrt {2/\pi } \approx 0.8\).

2.1 Diagnosing bimodality

Under a Gaussian mixture of \(N(\mu_{1} ,\sigma_{1}^{2} )\) and \(N(\mu_{2} ,\sigma_{2}^{2} )\), the rule of thumb to identify bimodality is \(a = \frac{{{ }\left| {\mu_{1} - \mu_{2} } \right|{ }}}{{\sigma_{1} + \sigma_{2} }}\) ≥ 1, as proposed by Cohen and Burke (1956) and Schilling et al. (2002). In practice \(\mu_{i}\)’s and \(\sigma_{i}^{2}\)’s are unknown before separation, and the bimodality coefficient (BC = (\({\text{Skewness}}^{2}\) + 1)/(Kurtosis + 3)), is commonly used to justify mixture distribution (Pfister et al. 2013). A sample with estimated BC greater than the benchmark value of 5/9 suggests a two-component mixture, where 5/9 is the expected BC for a uniform distribution. We consider this criterion too stringent for Gaussian mixtures. For example, the distribution of equal mixtures simulated from \(N( - 1.05,1)\) and \(N(1.05,1)\) is shown in Fig. 1. The histogram shows obvious twin peaks with bimodality = 1.05, however, BC = 0.41. We show the formula of BC for a Gaussian mixture of \(N(\mu_{1} ,\sigma_{1}^{2} )\) and \(N(\mu_{2} ,\sigma_{2}^{2} )\) in the “Appendix”, and suggest using the benchmark 0.4 instead of 0.556 (~ 5/9), where 0.4 is the expected BC of the equal mixture of \(N( - 1, 1)\) and \(N(1, 1)\), with corresponding bimodality = 1.

Fig. 1
figure 1

Equal mixture simulated from \(N\left( { - 1.05,1} \right)\) and \(N\left( {1.05,1} \right)\) show obvious twin peaks with bimodality = 1.05, however, with BC = 0.41

For illustration, we take the split at the inverse-standard-deviation-weighted (ISDW) mean as an example, i.e., \(h = \frac{{\frac{{\mu_{1} }}{{\sigma_{1} }} + \frac{{\mu_{2} }}{{\sigma_{2} }}}}{{\sigma_{1}^{ - 1} + \sigma_{2}^{ - 1} }} = \frac{{\left( {\sigma_{2} \mu_{1} + \sigma_{1} \mu_{2} } \right)}}{{\sigma_{1} + \sigma_{2} }}\), and show the relationships between \({\text{Corr(}}X_{d} (h),X{)}\), the bimodality index \(a = \frac{{\left| {\mu_{1} - \mu_{2} } \right|}}{{\sigma_{1} + \sigma_{2} }}\), and the mixture proportion γ in Fig. 2 (left). As bimodality increases, the point-biserial correlation increases. Furthermore, all the bimodal cases (a ≥ 1) have higher correlations than the horizontal line for the unimodal case (a = 0) at 0.798, except for the highly unbalanced mixtures with γ < 0.2 or γ > 0.8. Figure 2 (right) shows the relationships between \({\text{Corr(}}X_{d} (h),X{)}\), the split point h, and γ for cases with simplified equal variance (\(\sigma_{1} = \sigma_{2} = \sigma\)) and equal slope (\(\beta_{1} = \beta_{2}\)) for \(\mu_{1} = - \mu_{2}\) and \(a = 2\). We see that \({\text{Corr(}}X_{d} (h),X{)}\) peaks at mean 0 under various γ values, but for a highly unbalanced mixture with γ = 0.1, \({\text{Corr(}}X_{d} (h),X{)}\) can be lower than the 0.8 benchmark.

Fig. 2
figure 2

Illustration of point-biserial correlation of (\(X_{d} ,X\)) with mixture proportion γ and bimodality index \(a = \frac{{\left| {\mu_{1} - \mu_{2} } \right|}}{{\sigma_{1} + \sigma_{2} }}\) when split at an inverse-standard-deviation-weighted (ISDW) mean (left), and relationships between \({\text{Corr}}\left( {X_{d} \left( h \right),X} \right)\), the split point h (right)

Figure 3 shows an example of non-parallel and heterogeneous mixtures, with slopes 0 and 2 and variances 2 and 1 for groups 1 and 2, respectively, and proportion γ = 0.5. The corresponding IRR = 1.07 > 1, indicates that in a non-parallel Gaussian mixture, proper dichotomization may result in even higher point-biserial correlation than that of continuous variables.

Fig. 3
figure 3

Illustration of \({\text{Corr}}\left( {X_{d} \left( h \right),X} \right)\), split point h, and γ, when \(a = 2\) (assuming \(\sigma_{1} = \sigma_{2} = \sigma\) and \(\mu_{1} = - \mu_{2} = 2\sigma\)). The horizontal line near 0.8 is used as the benchmark

If \(X_{1}\) and \(X_{2}\) have parallel effects on Y, i.e., \(\beta_{1} = \beta_{2} = \beta\), then we have the following corollary.

Corollary 1

If \(\beta_{1} = \beta_{2} = \beta\):

$$\begin{gathered} {\text{Corr(}}X_{d} (h),{\text{Y)}} = {\text{Corr(}}X_{d} (h),{\text{X)}} \cdot \frac{{\beta \cdot \sqrt {\lambda_{1} + \lambda_{2} } }}{{\sqrt {\beta^{2} (\lambda_{1} + \lambda_{2} ) + \tau^{2} } }} \hfill \\ {\text{and IRR}}(h) = \frac{{{\text{Corr(}}X_{d} (h),{\text{Y)}}}}{{{\text{Corr}}(X,{\text{Y}})}} = {\text{Corr(}}X_{d} (h),{\text{X)}}.{ } \hfill \\ \end{gathered}$$

Since λ1 and λ2 are independent of the choice of h in the above scenario, i.e., \(Y = \beta X + \varepsilon\), the optimal h that maximizes \({\text{Corr(}}X_{d} (h),X{)}\) also maximizes \({\text{Corr(}}X_{d} (h),{\text{Y)}}\) and the IRR. This result coincides with Demirtaş and Hedeker (2016).

2.2 Comparing various splitting approaches by simulation

We used a simulation study to compare the two optimal splits provided in this paper with the following three commonly used cut-points. The median was directly calculated from the sample of X. The estimated proportion in the “\(\hat{\gamma }\) percentile” and the estimates of (\(\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}\)) in the ISDW mean were obtained by applying mclust(), a Gaussian Mixture Model classifier in R, on X. Our proposed splits h for Optimal_x and Optimal_xy were obtained by maximizing \({\text{Corr(}}X_{d} (h),X{)}\) and \({\text{Corr(}}X_{d} (h),Y{)}\), respectively. We assessed performance based on \({\text{Corr(}}X_{d} (h),{{X)}}\), \({\text{Corr(}}X_{d} (h),{{Y)}}\), and IRR.

The first part of the simulation was based on Gaussian mixture setting (1) and (2) with eight parameter settings of (\(\beta_{1} ,\beta_{2} ,\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}\)) and \(\tau^{2} = 1\). The first four settings assume a parallel x–y relation between two mixtures with \((\beta_{1} ,\beta_{2} ) = (1,1)\), while the last four settings assume non-parallel x–y with \((\beta_{1} ,\beta_{2} ) = (0,2)\), i.e., one mixture has dose–response slope \(\beta_{1} = 0\) and the other has slope \(\beta_{2} = 2\). Each setting was replicated 200 times. The bimodality index and BC of each Gaussian mixture setting, and the point-biserial correlation coefficients by five different dichotomizations are shown in Supplementary Table S1.

2.3 Robustness against non-normality

To test the methods’ robustness against non-normality, we replaced \(N(\mu_{1} ,\sigma_{1}^{2} )\) and \(N(\mu_{2} ,\sigma_{2}^{2} )\) of the Gaussian mixture in (1) with a mixture of two shifted exponential random variables: \(X_{1} + (\mu_{1} - \sigma_{1}\)) and \(X_{2} + (\mu_{2} - \sigma_{2}\)), where \(X_{i} \sim Exp(\sigma_{i} )\). Under this setting, the means and variances of the two distributions, \((\mu_{1} ,\sigma_{1}^{2} )\) and \((\mu_{2} ,\sigma_{2}^{2} )\), as well as the bimodality index \(\frac{{\left| {\mu_{1} - \mu_{2} } \right|}}{{\sigma_{1} + \sigma_{2} }}\), are the same as those of the Gaussian mixtures in Fig. 4. The simulation results are shown in Fig. 5, where each parameter setting is comparable with its counterpart in Fig. 4.

Fig. 4
figure 4

Comparison of point-biserial correlation Corr(\(X_{d} ,X)\) and IRR(Corr(\(X_{d} ,Y)/{\text{Corr}}\left( {X,Y} \right)\)) for 5 dichotomy methods (1: Median, 2: Percentile, 3: ISDW*, 4: Optimal_x and 5: Optimal_xy) by different bimodalities under parallel and non-parallel Gaussian mixture settings. *ISDW mean: inverse-standard-deviation-weighted mean as \(h = \frac{{\left( {\sigma_{2} \mu_{1} + \sigma_{1} \mu_{2} } \right)}}{{\sigma_{1} + \sigma_{2} }}\)

Fig. 5
figure 5

Comparison of point-biserial correlation Corr(\(X_{d} ,X)\) and IRR for 5 dichotomy methods (1: Median, 2: Percentile, 3: ISDW, 4: Optimal_x and 5: Optimal_xy) by different bimodalities under parallel and non-parallel non- exponential mixture settings

2.4 Results of simulation

Figure 4 shows the results from eight parameter settings of (\(\beta_{1} ,\beta_{2} ,\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}\)), with three resulting bimodality indices, 1, 4/3 and 2, under the Gaussian mixture. The split \(X_{d}\) of Optimal_x has the highest correlation with X, and therefore preserves the most original information. However, since its selection did not depend on Y, it may not have the highest correlation with Y. By contrast, \(X_{d}\) of Optimal_xy has the highest correlation with Y, however, it is noted that the selection of this split is based the assumption of a random sample scheme on (X, Y), and the split may vary with the sampling scheme of Y, for example, stratified sampling. Scenarios with larger bimodality achieved higher IRR for dichotomization.

For all the parallel settings in Fig. 4, IRR < 1 and higher modalities resulted in higher correlation and IRR. However, in some non-parallel settings, IRR > 1, and the two optimal splits performed much better than other splits, particularly when compared with results in parallel settings. This may indicate that, if the explanatory variable X is the mixture of two normally distributed groups that each have a different effect on outcome Y, then proper dichotomization on X may be suitable in the sense that there is not as much loss of information of continuous X, and the point biserial-correlation may be higher than the original.

Figure 5 shows the results with similar parameter settings except using a mixture of exponential distribution. Compared with its counterparts in Fig. 4, most of the IRRs drop. More details can be found in Supplementary Tables S1S2 where, on average, the IRR for dichotomization by \(\hat{\gamma }\) percentile and ISDW_mean dropped 18.3% and 13.7%, respectively, while that by Optimal_x and Optimal_xy dropped only 4.9% and 6.0%, respectively. This suggests that although deviation from the Gaussian mixture assumption may reduce information retention, the two proposed dichotomizations have the least loss.

2.5 Example with real data sets

We used four real-world data sets to illustrate the performance of various dichotomization strategies, as shown in Table 1 and Fig. 6. Optimal_x and Optimal_xy are the cut-points obtained from dicho_cor_x() and dicho_cor_xy() by maximizing \({\text{Corr(}}x_{d} (h),x{)}\), and \({\text{Corr(}}x_{d} (h),y{)}\) in Theorems 1 and 2.

Table 1 Results after various dichotomization strategies for three real-world data sets
Fig. 6
figure 6

Distribution of X and corresponding optimal cut from 4 data sets

2.5.1 Data 1

Clinical trials have demonstrated an association between HbA1c and type 2 diabetes. Thus, HbA1c is an important reference for diabetes diagnosis. HbA1c is known to estimate blood glucose levels over the past 2–3 months (Zhou et al. 2013). In the following data by Lian et al. (2023), 268 ischemic stroke patients with measured HbA1c and before-meal blood glucose (BBG) were recruited, and 83 of them were diagnosed with diabetes. Using procedure dicho_cor_x(), the cut-point at plasma glucose concentrations 140 mg/dL was selected, which happened to be the same as the signal of diabetes proposed by American Diabetes Association (2009). This suggested that cut-point at 140 mg/dL is not only suitable for general population, but also for this specific group of ischemic stroke patients.

The Pearson correlation coefficient between HbA1c and glucose was 0.69, and the point-biserial correlation coefficient between HbA1c and a dichotomised glucose indicator (dGlucose, dichotomized at Glucose = 140) was 0.67. This indicated that 95% of the information was retained after dichotomization. Table 2 shows the performance of diabetes status prediction by logistic regression using four models: Glucose (continuous) only, dGlucose (dichotomized at Glucose = 140) only, Glucose + HbA1C, and dGlucose + HbA1C. The resulting sensitivity and specificity by dGlucose are 72.3% and 64.9% respectively, with Youden index (sensitivity + specificity − 1) similar to the maximum Youden index of continuous Glucose. With the additional predictor HbA1c, dGlucose also achieved a similar Youden index as the continuous model.

Table 2 Classification results of diabetes (DM) by logistic regression using four models: Glucose (continuous) only, dGlucose (dichotomized at Glucose = 140) only, Glucose + HbA1C, and dGlucose + HbA1C

2.5.2 Data 2

Earthquake data were collected by the Seismology Center of Taiwan during the first 48h after an earthquake of local magnitude ML 7.3 struck Taiwan on September 21, 1999. The variables include 47 records of magnitude (X), duration (Y), depth, location, etc. for the main shock and early aftershocks. Figure 6 (Data 2) shows that X is bimodal with bimodality 2.26 > 1. The median split \(x_{d}\) explains only 47.3% (square of 0.688) variation of \(x\), and only retains 56.1% of the correlation with duration. Other splits preserve more information of the continuous variable (70–72%), and achieve higher IRR (0.86–0.875). A link to the dataset is provided with the software at the end of this paper.

2.5.3 Data 3

Height (X) and weight (Y) data of 127,682 (82,035 women and 45,647 men) were collected by Taiwan Biobank. Figure 6 (Data 3) shows that the bimodality of height is not obvious with a = 0.865; therefore, in this case, there is not as much difference between the splits as in Data 1. Nonetheless, the two optimal splits preserve more information of the continuous height and have higher IRR with the weight than the other dichotomizations.

2.5.4 Data 4

Insurance data contributed by Bob Wakefield were downloaded from data.world. We analyzed the amount of claims (X) of 1,338 insured people. The histogram of X in Fig. 6 (Data 4) is skewed to the right, similar to Data 1, with moderate bimodality a = 1.22. As in the previous examples, Optimal_x preserves more information (0.872 = 0.757) of X than splits like median (0.6892 = 0.475) and ISDW mean (0.792 = 0.624).

In summary, studies based on simulations and real-world data indicate that our proposed approaches preserve the most information of X and usually result in a higher retention rate for correlation with outcome Y. All four datasets are available, and the link is provided with the software (see the following user’s manual).

2.6 Software

The R codes for the functions dicho_cor_x() and dicho_cor_xy(), which calculate the cut-point h that yields the maximum \({\text{Corr(}}X_{d} (h),{\text{X)}}\) and \({\text{Corr(}}X_{d} (h),{\text{Y)}}\), can be downloaded from the following github link, and the flowcharts of the algorithms are shown in Fig. 7. The user first needs to install the functions mclust(), optimize(), and piecewise() from the R library: https://github.com/iblian/dicho_cor

Fig. 7
figure 7

The flowcharts of the algorithms

3 Conclusion

When dichotomizing data with bimodal distribution, this paper assumed that the artificial division into two groups with larger difference in mean may have better interpretability when lacking other domain knowledge for dichotomization. We thus proposed a dichotomization procedure that maximizes the standardized mean difference between groups while also minimizing the information loss for an assumption of bimodal Gaussian mixtures. Therefore, the optima may not hold if the assumption is violated, data are not from a random sample, or when other domain knowledge is available.

When deviating from the Gaussian mixture assumption (in our simulation, to exponential), the IRRs for cut-points by Optimal_x and Optimal_xy dropped only 4.9% and 6.0%, respectively, while the IRR for other cut-points (\(\hat{\gamma }\) percentile and ISDW_mean) dropped, on average, 18.3% and 13.7%, respectively. Our simulation shows that even with non-Gaussian mixture (two-component exponential in our trials), the \({\text{Corr(}}X_{d} ,X{)}\) obtained from Optimal_x is > 0.8 when BC > 0.46, and is > 0.9 when BC > 0.5. Therefore, we propose the following dichotomization procedure for \(X\):

Step 1:

Calculate bimodality coefficient (BC) of \(X\). If BC > 0.4, then proceed.

Step 2:

Apply dicho_cor_x() and dicho_cor_xy() to calculate the cut-point h that yields the maximum \({\text{Corr(}}X_{d} (h),{\text{X)}}\) and \({\text{Corr(}}X_{d} (h),{\text{Y)}}\).

This paper established the connection between Pearson correlation and point-biserial correlation for underlying Gaussian mixture distributions with different levels of bimodality, and proposed a dichotomization method based on maximizing point-biserial correlation as an equivalence to minimizing information loss. Although minimizing information loss is not a main objective, it is a desired property for interpretable dichotomization.