Optimal dichotomization of bimodal Gaussian mixtures

Jhan, Yan-ni; Li, Wan-cen; Ruan, Shin-hui; Sie, Jia-jyun; Lian, Iebin

doi:10.1007/s00362-023-01521-1

Optimal dichotomization of bimodal Gaussian mixtures

Regular Article
Open access
Published: 02 January 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Statistical Papers Aims and scope Submit manuscript

Optimal dichotomization of bimodal Gaussian mixtures

Download PDF

Yan-ni Jhan¹,
Wan-cen Li²,
Shin-hui Ruan²,
Jia-jyun Sie³ &
…
Iebin Lian ORCID: orcid.org/0000-0003-3992-6049²

1159 Accesses
3 Altmetric
Explore all metrics

Abstract

Despite criticism for loss of information and power, dichotomization of variables is still frequently used in social, behavioral, and medical sciences, mainly because it yields more interpretable conclusions for research outcomes and is useful for decision making. However, the artificial choice of cut-points can be controversial and needs proper justification. In this work, we investigate the properties of point-biserial correlation after dichotomization with underlying bimodal Gaussian mixture distributions. We propose a dichotomous grouping procedure that considers the largest standardized difference in group mean while minimizing information loss.

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

Article 20 June 2017

An Overview on the URV Model-Based Approach to Cluster Mixed-Type Data

The impact of ordinal scales on Gaussian mixture recovery

Article Open access 13 July 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Despite the undeniable criticism for loss of information and power (Kuss 2013; Altman and Royston 2006; Royston et al. 2006), dichotomization of variables is still frequently used in social, behavioral, and medical sciences (Altman and Royston 2006). This is mostly due to the fact that dichotomization can yield more interpretable conclusions, supporting the dissemination of research outcomes (Demirtaş and Vardar Acar 2017), and can provide criteria for decision making. For example, in clinical practice, dichotomization of systolic blood pressure as under/over 140 mmHg or birth weight as under/over 2500 g is used to establish thresholds for treatment (Peacock et al. 2012). The technique is also used to model insurance losses to support decision making in the actuarial industry (Tomarchio and Punzo 2020).

The choice of cut-point for dichotomization could be based on domain knowledge, such as biology, or based on statistical factors. In this paper, we consider a statistical approach to determining cut-points, in which the dichotomy is artificial. For example, an obese vs. non-obese categorization is based on an underlying distribution of a continuous variable, such as body mass index. A common form of dichotomization or categorization of a continuous variable is to split the categories at the median or quartiles. Cut-point choices should be properly justified.

For any transformation $g$ on X, the R-squared from regression of X on $g\left( X \right)$, or simply the squared Pearson correlation, ${\text{Corr}}\left( {X,{ }g\left( X \right)} \right)^{2}$, can be interpreted as the proportion of the information of X explained by $g\left( X \right)$. Therefore, $1 - {\text{Corr}}\left( {X,{ }g\left( X \right)} \right)^{2}$ can represent the proportion of the loss of information in $X$ due to transformation X$\to g\left( X \right)$. Consequently, for dichotomization transformation $g\left( X \right) = X_{d}$, the information loss is $1 - {\text{Corr}}\left( {X,{ }X_{d} } \right)^{2}$, where ${\text{Corr}}\left( {X,{ }X_{d} } \right)$ is known as point-biserial correlation. Since dichotomization is highly criticized for its information loss, we use point-biserial correlation to assess the information loss performance among choices of cut-points.

If $X$ is normally distributed and is dichotomized at the mean or median as $X_{d}$, then ${\text{Corr}}\left( {X,{ }X_{d} } \right)^{2} = 2/\pi \approx 0.8^{2}$ (Cohen 1983). For a bivariate normal pair (X, Y) with correlation r, dichotomizing one of them at its median reduces the correlation to 0.8r, or equivalently, it leads to a nearly 36% (1 − 2/$\uppi$) reduction in R-squared (MacCallum et al. 2002). Theories of point-biserial correlation are well-established when the underlying distribution is Gaussian. Further, Demirtaş and Hedeker (2016) investigated it for log-normal, student-t, beta, and uniform-exponential distributions. However, its properties remain unclear for underlying Gaussian mixture distributions with different levels of bimodality.

1.1 Bimodal distribution and Gaussian mixtures

Bimodal distributions often occur when data are composed of observations from two different groups of subjects (Devore and Peck 1997, p. 43; Fiorio et al. 2010), where the groups are characterized by binary traits such as gender or disease status, for example, the heights of men and women (Schilling et al. 2002), birthweight at certain gestational age ranges (Haglund 2007), and the distribution of speech sounds (McMurray et al. 2009). If the trait is known, then the data can be easily adjusted to restore the unimodal property in the model building. However, if the trait is missing or latent, such as unidentified genotypes, then a dichotomization may serve a proper surrogate of the latent traits.

In this paper, we aim to investigate the connection between point-biserial correlation and Pearson correlation when the underlying distribution is a Gaussian mixture, and to provide an optimal dichotomization algorithm in the sense of minimal information loss after identifying the bimodality of such distributions.

Given an explanatory variable X and an outcome Y, researchers have investigated optimal dichotomization from various perspectives. For a binary explanatory variable X (e.g., treatment vs. control) and a continuous outcome Y, several authors (Peacock et al. 2012; Ofuya et al. 2014; Sauzet et al. 2015) have proposed distributional approaches for dichotomizing the outcome Y by deriving the difference in proportions for treatment and control with a 95% confidence interval. For a binary outcome Y and a continuous predictor X, Nelson et al. (2017) reviewed common methods for dichotomizing the predictor X with the aim of maximizing certain statistics, such as the odds ratio, Youden’s statistic, the Gini index, the chi-square statistic, relative risk, and the kappa statistic. Chen et al. (2019) proposed two optimal cut-points for continuous predictors if their relationship with the hazard outcome is U-shaped.

In contrast with the above studies, we explore the impact of dichotomization by considering the case where both the explanatory variable X and the outcome Y are continuous. Given $X_{d}$ as a binary variable after the dichotomization of X with underlying Gaussian mixture, we propose two optimal cut-points by maximizing the corresponding point-biserial correlations ${\text{Corr(}}X_{d} ,X{)}$ and ${\text{Corr(}}X_{d} ,Y{)}$.

Note that point-biserial correlation $r_{b}$ is related to the two-sample t test as ${\text{t}} = r_{b} \frac{{\sqrt { n - 2} }}{{\sqrt { (1 - r_{b}^{2} )} }}$ when testing the difference in means. By regarding the t-statistic as the standardized difference between the means of two groups, maximizing $r_{bx} = {\text{Corr(}}X_{d} ,X{)}$ with respect to $X_{d}$ becomes equivalent to dichotomizing $X$ into the two most divergent groups in terms of the largest standardized difference between the group means of $X$. On the other hand, maximizing $r_{by} = {\text{Corr(}}X_{d} ,Y{)}$ with respect to $X_{d}$ is equivalent to separating $Y$ into the two most divergent groups in terms of the largest standardized difference between group means of $Y$. Therefore, if the bimodality of a variable is due to the mixture of two divergent groups, then a dichotomization $X_{d}$ that divides the data into two groups with respect to the largest standardized difference may help the interpretability of the data, especially when other domain knowledge for dichotomization is unavailable.

In simple regression of Y on X or $X_{d}$, the squares of sample correlation ${\text{Corr(}}X,Y{)}$ and ${\text{Corr(}}X_{d} ,Y{)}$ can be interpreted as the proportion of information from Y explained by $X$ and $X_{d}$, respectively. Thus, we define the ratio ${\text{Corr(}}X_{d} ,Y{)}/{\text{Corr(}}X,Y{)}$ as the information retention rate (IRR) to assess the performance of $X_{d}$.

Under the following statistical model, we derive the forms of ${\text{Corr(}}X_{d} ,X{)}$ and ${\text{Corr(}}X_{d} ,Y{)}$ in Theorems 1 and 2, respectively. Based on the results, we provide the R functions to calculate the optimal cut-points. We use simulation and real-world data examples to compare the performance of the proposed approach with other commonly used dichotomizations.

1.2 Statistical models

For a continuous outcome Y and an explanatory variable X, we assume the following Gaussian mixture on X and a piecewise linear model on X and Y.

(i)
Let
$$X_{1} \sim N(\mu_{1} ,\sigma_{1}^{2} ),X_{2} \sim N(\mu_{2} ,\sigma_{2}^{2} )\;{\text{and}}\;X = {\text{W}} \cdot X_{1} + (1 - W) \cdot X_{2} ,$$
(1)
where W is a binary (0,1) variable independent of $X_{1}$ and $X_{2}$ with Pr(W = 1) = γ. Then, X is a Gaussian mixture of $X_{1}$ and $X_{2}$ with proportion γ with probability density functions (pdf) γ$\cdot f_{1} (X) + (1 - \gamma ) \cdot f_{2} (X)$, where $f_{i}^{'} s$ are pdf of $X_{i}$.
(ii)
Let
$$Y_{i} = \beta_{i} X_{i} + \varepsilon_{i} , { }i = 1,{ }2,{\text{ where}}\;\varepsilon_{i} { }\sim {\text{ N(}}0,\tau_{i}^{2} {)}\;{\text{is independent noise}}.$$
(2)

Then, $(X_{i} ,Y_{i} )$ has the following bivariate normal distribution:
$${ (}X_{i} ,Y_{i} {)}\sim {\varvec{N}}(\mu_{i} , \mu_{yi} = \beta_{i} \mu_{i} , \sigma_{i}^{2} ,{ }\sigma_{yi}^{2} = \tau_{i}^{2} + \beta_{i}^{2} \sigma_{i}^{2} , \rho_{i} )\;{\text{with the joint pdf}}\;f_{i} (X,{\text{Y}}),$$
where $\rho _{i}=\beta _{i}\sigma_{i}/\sqrt{\tau_{i}^{2}+\beta_{i}^{2}\sigma _{i}^{2}}$, and the joint pdf of $(X, Y)$ is $\gamma \cdot f_{1} (X,Y) + (1 - \gamma ) \cdot f_{2} (X,Y).$
(iii)
Let $X_{d} (h) = \left\{ {\begin{array}{*{20}c} {1 if x \ge h} \\ { - 1 if x < h} \\ \end{array} } \right.$ be the dichotomization of $X$ with cut-point at h.

2 Results

Theorem 1

Point-biserial correlation of X and its dichotomization $X_{d} (h)$:

$${\text{Corr(}}X_{d} (h),X{)} = \frac{{(\lambda_{3} + \lambda_{4} ) + 2\gamma (1 - \gamma )(\mu_{1} - \mu_{2} )\left( {\Phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right) - \Phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right)}}{{\sqrt {1 - \left( {\lambda_{5} + \lambda_{6} } \right)^{2} } \cdot\sqrt {\lambda_{1} + \lambda_{2} } }},$$

(3)

where ϕ and Φ are the pdf and cumulative density function (cdf) of the standard normal distribution, and

$\lambda_{1} = \gamma (1 - \gamma )(\mu_{1} - \mu_{2} )^{2} ;\lambda_{2} = \gamma \cdot \sigma_{1}^{2} + (1 - \gamma )\sigma_{2}^{2} ,$
$\lambda_{3} = \gamma \left( {2\sigma_{1} \phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right), \lambda_{4} = \left( {1 - \gamma } \right)\left( {2\sigma_{2} \phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right)} \right);$
$\lambda_{5} = \gamma \left( {1 - 2\Phi \left( {\frac{{h - \mu_{1} }}{{\sigma_{1} }}} \right)} \right), {\text{and}} \lambda_{6} = \left( {1 - \gamma } \right)\left( {1 - 2\Phi \left( {\frac{{h - \mu_{2} }}{{\sigma_{2} }}} \right)} \right).$

Theorem 2

Point-biserial correlation of outcome Y and the dichotomization $X_{d} (h)$:

$${\text{Corr(}}X_{d} (h),{\text{Y)}} = \frac{{(\beta_{1} \lambda_{3} + \beta_{2} \lambda_{4} ) + (\beta_{1} \mu_{1} \lambda_{5} + \beta_{2} \mu_{2} \lambda_{6} ) - \mu_{y} (\lambda_{5} + \lambda_{6} )}}{{\sqrt {1 - (\lambda_{5} + \lambda_{6} )^{2} } \cdot\sqrt {\lambda_{1}^{\prime } + \lambda_{2}^{\prime } + { }\tau^{2} } }},$$

(4)

where $\tau^{2} = \gamma \cdot \tau_{1}^{2} + (1 - \gamma \cdot )\tau_{2}^{2} , \lambda^{\prime}_{1} = \gamma (1 - \gamma )(\beta_{1} \mu_{1} - \beta_{2} \mu_{2} )^{2}$ and $\lambda^{\prime}_{2} = \gamma \cdot \beta_{1}^{2} \sigma_{1}^{2} + (1 - \gamma )\beta_{2}^{2} \sigma_{2}^{2} .$

Note that in the case of a single Gaussian model, e.g., $\mu_{1} = \mu_{2} = 0$ and $\sigma_{1} = \sigma_{2} = 1$, Theorem 1 simplifies to the following well-known result:

$${\text{Corr(}}X_{d} (h),X{)} = \frac{\phi (h)}{{\sqrt {\Phi (h)(1 - \Phi (h))} }}.$$

Furthermore, when the split is at the mean, i.e., h = 0, this gives $\sqrt {2/\pi } \approx 0.8$.

2.1 Diagnosing bimodality

Under a Gaussian mixture of $N(\mu_{1} ,\sigma_{1}^{2} )$ and $N(\mu_{2} ,\sigma_{2}^{2} )$, the rule of thumb to identify bimodality is $a = \frac{{{ }\left| {\mu_{1} - \mu_{2} } \right|{ }}}{{\sigma_{1} + \sigma_{2} }}$ ≥ 1, as proposed by Cohen and Burke (1956) and Schilling et al. (2002). In practice $\mu_{i}$’s and $\sigma_{i}^{2}$’s are unknown before separation, and the bimodality coefficient (BC = (${\text{Skewness}}^{2}$ + 1)/(Kurtosis + 3)), is commonly used to justify mixture distribution (Pfister et al. 2013). A sample with estimated BC greater than the benchmark value of 5/9 suggests a two-component mixture, where 5/9 is the expected BC for a uniform distribution. We consider this criterion too stringent for Gaussian mixtures. For example, the distribution of equal mixtures simulated from $N( - 1.05,1)$ and $N(1.05,1)$ is shown in Fig. 1. The histogram shows obvious twin peaks with bimodality = 1.05, however, BC = 0.41. We show the formula of BC for a Gaussian mixture of $N(\mu_{1} ,\sigma_{1}^{2} )$ and $N(\mu_{2} ,\sigma_{2}^{2} )$ in the “Appendix”, and suggest using the benchmark 0.4 instead of 0.556 (~ 5/9), where 0.4 is the expected BC of the equal mixture of $N( - 1, 1)$ and $N(1, 1)$, with corresponding bimodality = 1.

For illustration, we take the split at the inverse-standard-deviation-weighted (ISDW) mean as an example, i.e., $h = \frac{{\frac{{\mu_{1} }}{{\sigma_{1} }} + \frac{{\mu_{2} }}{{\sigma_{2} }}}}{{\sigma_{1}^{ - 1} + \sigma_{2}^{ - 1} }} = \frac{{\left( {\sigma_{2} \mu_{1} + \sigma_{1} \mu_{2} } \right)}}{{\sigma_{1} + \sigma_{2} }}$, and show the relationships between ${\text{Corr(}}X_{d} (h),X{)}$, the bimodality index $a = \frac{{\left| {\mu_{1} - \mu_{2} } \right|}}{{\sigma_{1} + \sigma_{2} }}$, and the mixture proportion γ in Fig. 2 (left). As bimodality increases, the point-biserial correlation increases. Furthermore, all the bimodal cases (a ≥ 1) have higher correlations than the horizontal line for the unimodal case (a = 0) at 0.798, except for the highly unbalanced mixtures with γ < 0.2 or γ > 0.8. Figure 2 (right) shows the relationships between ${\text{Corr(}}X_{d} (h),X{)}$, the split point h, and γ for cases with simplified equal variance ($\sigma_{1} = \sigma_{2} = \sigma$) and equal slope ($\beta_{1} = \beta_{2}$) for $\mu_{1} = - \mu_{2}$ and $a = 2$. We see that ${\text{Corr(}}X_{d} (h),X{)}$ peaks at mean 0 under various γ values, but for a highly unbalanced mixture with γ = 0.1, ${\text{Corr(}}X_{d} (h),X{)}$ can be lower than the 0.8 benchmark.

Figure 3 shows an example of non-parallel and heterogeneous mixtures, with slopes 0 and 2 and variances 2 and 1 for groups 1 and 2, respectively, and proportion γ = 0.5. The corresponding IRR = 1.07 > 1, indicates that in a non-parallel Gaussian mixture, proper dichotomization may result in even higher point-biserial correlation than that of continuous variables.

If $X_{1}$ and $X_{2}$ have parallel effects on Y, i.e., $\beta_{1} = \beta_{2} = \beta$, then we have the following corollary.

Corollary 1

If $\beta_{1} = \beta_{2} = \beta$:

$$\begin{gathered} {\text{Corr(}}X_{d} (h),{\text{Y)}} = {\text{Corr(}}X_{d} (h),{\text{X)}} \cdot \frac{{\beta \cdot \sqrt {\lambda_{1} + \lambda_{2} } }}{{\sqrt {\beta^{2} (\lambda_{1} + \lambda_{2} ) + \tau^{2} } }} \hfill \\ {\text{and IRR}}(h) = \frac{{{\text{Corr(}}X_{d} (h),{\text{Y)}}}}{{{\text{Corr}}(X,{\text{Y}})}} = {\text{Corr(}}X_{d} (h),{\text{X)}}.{ } \hfill \\ \end{gathered}$$

Since λ₁ and λ₂ are independent of the choice of h in the above scenario, i.e., $Y = \beta X + \varepsilon$, the optimal h that maximizes ${\text{Corr(}}X_{d} (h),X{)}$ also maximizes ${\text{Corr(}}X_{d} (h),{\text{Y)}}$ and the IRR. This result coincides with Demirtaş and Hedeker (2016).

2.2 Comparing various splitting approaches by simulation

We used a simulation study to compare the two optimal splits provided in this paper with the following three commonly used cut-points. The median was directly calculated from the sample of X. The estimated proportion in the “$\hat{\gamma }$ percentile” and the estimates of ($\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}$) in the ISDW mean were obtained by applying mclust(), a Gaussian Mixture Model classifier in R, on X. Our proposed splits h for Optimal_x and Optimal_xy were obtained by maximizing ${\text{Corr(}}X_{d} (h),X{)}$ and ${\text{Corr(}}X_{d} (h),Y{)}$, respectively. We assessed performance based on ${\text{Corr(}}X_{d} (h),{{X)}}$, ${\text{Corr(}}X_{d} (h),{{Y)}}$, and IRR.

The first part of the simulation was based on Gaussian mixture setting (1) and (2) with eight parameter settings of ($\beta_{1} ,\beta_{2} ,\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}$) and $\tau^{2} = 1$. The first four settings assume a parallel x–y relation between two mixtures with $(\beta_{1} ,\beta_{2} ) = (1,1)$, while the last four settings assume non-parallel x–y with $(\beta_{1} ,\beta_{2} ) = (0,2)$, i.e., one mixture has dose–response slope $\beta_{1} = 0$ and the other has slope $\beta_{2} = 2$. Each setting was replicated 200 times. The bimodality index and BC of each Gaussian mixture setting, and the point-biserial correlation coefficients by five different dichotomizations are shown in Supplementary Table S1.

2.3 Robustness against non-normality

To test the methods’ robustness against non-normality, we replaced $N(\mu_{1} ,\sigma_{1}^{2} )$ and $N(\mu_{2} ,\sigma_{2}^{2} )$ of the Gaussian mixture in (1) with a mixture of two shifted exponential random variables: $X_{1} + (\mu_{1} - \sigma_{1}$) and $X_{2} + (\mu_{2} - \sigma_{2}$), where $X_{i} \sim Exp(\sigma_{i} )$. Under this setting, the means and variances of the two distributions, $(\mu_{1} ,\sigma_{1}^{2} )$ and $(\mu_{2} ,\sigma_{2}^{2} )$, as well as the bimodality index $\frac{{\left| {\mu_{1} - \mu_{2} } \right|}}{{\sigma_{1} + \sigma_{2} }}$, are the same as those of the Gaussian mixtures in Fig. 4. The simulation results are shown in Fig. 5, where each parameter setting is comparable with its counterpart in Fig. 4.

2.4 Results of simulation

Figure 4 shows the results from eight parameter settings of ($\beta_{1} ,\beta_{2} ,\mu_{1} ,\mu_{2} ,\sigma_{1} ,\sigma_{2}$), with three resulting bimodality indices, 1, 4/3 and 2, under the Gaussian mixture. The split $X_{d}$ of Optimal_x has the highest correlation with X, and therefore preserves the most original information. However, since its selection did not depend on Y, it may not have the highest correlation with Y. By contrast, $X_{d}$ of Optimal_xy has the highest correlation with Y, however, it is noted that the selection of this split is based the assumption of a random sample scheme on (X, Y), and the split may vary with the sampling scheme of Y, for example, stratified sampling. Scenarios with larger bimodality achieved higher IRR for dichotomization.

For all the parallel settings in Fig. 4, IRR < 1 and higher modalities resulted in higher correlation and IRR. However, in some non-parallel settings, IRR > 1, and the two optimal splits performed much better than other splits, particularly when compared with results in parallel settings. This may indicate that, if the explanatory variable X is the mixture of two normally distributed groups that each have a different effect on outcome Y, then proper dichotomization on X may be suitable in the sense that there is not as much loss of information of continuous X, and the point biserial-correlation may be higher than the original.

Figure 5 shows the results with similar parameter settings except using a mixture of exponential distribution. Compared with its counterparts in Fig. 4, most of the IRRs drop. More details can be found in Supplementary Tables S1–S2 where, on average, the IRR for dichotomization by $\hat{\gamma }$ percentile and ISDW_mean dropped 18.3% and 13.7%, respectively, while that by Optimal_x and Optimal_xy dropped only 4.9% and 6.0%, respectively. This suggests that although deviation from the Gaussian mixture assumption may reduce information retention, the two proposed dichotomizations have the least loss.

2.5 Example with real data sets

We used four real-world data sets to illustrate the performance of various dichotomization strategies, as shown in Table 1 and Fig. 6. Optimal_x and Optimal_xy are the cut-points obtained from dicho_cor_x() and dicho_cor_xy() by maximizing ${\text{Corr(}}x_{d} (h),x{)}$, and ${\text{Corr(}}x_{d} (h),y{)}$ in Theorems 1 and 2.

Table 1 Results after various dichotomization strategies for three real-world data sets

Full size table

2.5.1 Data 1

Clinical trials have demonstrated an association between HbA1c and type 2 diabetes. Thus, HbA1c is an important reference for diabetes diagnosis. HbA1c is known to estimate blood glucose levels over the past 2–3 months (Zhou et al. 2013). In the following data by Lian et al. (2023), 268 ischemic stroke patients with measured HbA1c and before-meal blood glucose (BBG) were recruited, and 83 of them were diagnosed with diabetes. Using procedure dicho_cor_x(), the cut-point at plasma glucose concentrations 140 mg/dL was selected, which happened to be the same as the signal of diabetes proposed by American Diabetes Association (2009). This suggested that cut-point at 140 mg/dL is not only suitable for general population, but also for this specific group of ischemic stroke patients.

The Pearson correlation coefficient between HbA1c and glucose was 0.69, and the point-biserial correlation coefficient between HbA1c and a dichotomised glucose indicator (dGlucose, dichotomized at Glucose = 140) was 0.67. This indicated that 95% of the information was retained after dichotomization. Table 2 shows the performance of diabetes status prediction by logistic regression using four models: Glucose (continuous) only, dGlucose (dichotomized at Glucose = 140) only, Glucose + HbA1C, and dGlucose + HbA1C. The resulting sensitivity and specificity by dGlucose are 72.3% and 64.9% respectively, with Youden index (sensitivity + specificity − 1) similar to the maximum Youden index of continuous Glucose. With the additional predictor HbA1c, dGlucose also achieved a similar Youden index as the continuous model.

Table 2 Classification results of diabetes (DM) by logistic regression using four models: Glucose (continuous) only, dGlucose (dichotomized at Glucose = 140) only, Glucose + HbA1C, and dGlucose + HbA1C

Full size table

2.5.2 Data 2

Earthquake data were collected by the Seismology Center of Taiwan during the first 48h after an earthquake of local magnitude M_L 7.3 struck Taiwan on September 21, 1999. The variables include 47 records of magnitude (X), duration (Y), depth, location, etc. for the main shock and early aftershocks. Figure 6 (Data 2) shows that X is bimodal with bimodality 2.26 > 1. The median split $x_{d}$ explains only 47.3% (square of 0.688) variation of $x$, and only retains 56.1% of the correlation with duration. Other splits preserve more information of the continuous variable (70–72%), and achieve higher IRR (0.86–0.875). A link to the dataset is provided with the software at the end of this paper.

2.5.3 Data 3

Height (X) and weight (Y) data of 127,682 (82,035 women and 45,647 men) were collected by Taiwan Biobank. Figure 6 (Data 3) shows that the bimodality of height is not obvious with a = 0.865; therefore, in this case, there is not as much difference between the splits as in Data 1. Nonetheless, the two optimal splits preserve more information of the continuous height and have higher IRR with the weight than the other dichotomizations.

2.5.4 Data 4

Insurance data contributed by Bob Wakefield were downloaded from data.world. We analyzed the amount of claims (X) of 1,338 insured people. The histogram of X in Fig. 6 (Data 4) is skewed to the right, similar to Data 1, with moderate bimodality a = 1.22. As in the previous examples, Optimal_x preserves more information (0.87² = 0.757) of X than splits like median (0.689² = 0.475) and ISDW mean (0.79² = 0.624).

In summary, studies based on simulations and real-world data indicate that our proposed approaches preserve the most information of X and usually result in a higher retention rate for correlation with outcome Y. All four datasets are available, and the link is provided with the software (see the following user’s manual).

2.6 Software

The R codes for the functions dicho_cor_x() and dicho_cor_xy(), which calculate the cut-point h that yields the maximum ${\text{Corr(}}X_{d} (h),{\text{X)}}$ and ${\text{Corr(}}X_{d} (h),{\text{Y)}}$, can be downloaded from the following github link, and the flowcharts of the algorithms are shown in Fig. 7. The user first needs to install the functions mclust(), optimize(), and piecewise() from the R library: https://github.com/iblian/dicho_cor

3 Conclusion

When dichotomizing data with bimodal distribution, this paper assumed that the artificial division into two groups with larger difference in mean may have better interpretability when lacking other domain knowledge for dichotomization. We thus proposed a dichotomization procedure that maximizes the standardized mean difference between groups while also minimizing the information loss for an assumption of bimodal Gaussian mixtures. Therefore, the optima may not hold if the assumption is violated, data are not from a random sample, or when other domain knowledge is available.

When deviating from the Gaussian mixture assumption (in our simulation, to exponential), the IRRs for cut-points by Optimal_x and Optimal_xy dropped only 4.9% and 6.0%, respectively, while the IRR for other cut-points ($\hat{\gamma }$ percentile and ISDW_mean) dropped, on average, 18.3% and 13.7%, respectively. Our simulation shows that even with non-Gaussian mixture (two-component exponential in our trials), the ${\text{Corr(}}X_{d} ,X{)}$ obtained from Optimal_x is > 0.8 when BC > 0.46, and is > 0.9 when BC > 0.5. Therefore, we propose the following dichotomization procedure for $X$:

Step 1:: Calculate bimodality coefficient (BC) of $X$. If BC > 0.4, then proceed.
Step 2:: Apply dicho_cor_x() and dicho_cor_xy() to calculate the cut-point h that yields the maximum ${\text{Corr(}}X_{d} (h),{\text{X)}}$ and ${\text{Corr(}}X_{d} (h),{\text{Y)}}$.

This paper established the connection between Pearson correlation and point-biserial correlation for underlying Gaussian mixture distributions with different levels of bimodality, and proposed a dichotomization method based on maximizing point-biserial correlation as an equivalence to minimizing information loss. Although minimizing information loss is not a main objective, it is a desired property for interpretable dichotomization.

References

Altman DG, Royston P (2006) The cost of dichotomising continuous variables. BMJ 332(7549):1080
Article Google Scholar
American Diabetes Association (2009) Diagnosis and classification of diabetes mellitus. Diabetes Care 32(Suppl 1):S62–S67. https://doi.org/10.2337/dc09-S062
Article Google Scholar
Chen Y, Huang J, He X et al (2019) A novel approach to determine two optimal cut-points of a continuous predictor with a U-shaped relationship to hazard ratio in survival data: simulation and application. BMC Med Res Methodol 19(1):96
Article Google Scholar
Cohen J (1983) The cost of dichotomization. Appl Psychol Meas 7(3):249–253
Article Google Scholar
Cohen AC, Burke PJ (1956) Compound normal distribution. (Advanced Problems and Solutions). Am Math Monthly 63:129
Article Google Scholar
Demirtaş H, Hedeker D (2016) Computing the Point-biserial correlation under any underlying continuous distribution. Commun Stat-Simul Comput 45(8):2744–2751. https://doi.org/10.1080/03610918.2014.920883
Article MathSciNet Google Scholar
Demirtaş H, Vardar Acar C (2017) Anatomy of correlational magnitude transformations in latency and discretization contexts in Monte-Carlo studies, p 85. http://www.springer.com/gp/book/9789811033063#aboutAuthors
Devore J, Peck R (1997) Statistics: the exploration and analysis of data. Duxbury Press, Belmont
Google Scholar
Fiorio CV, Hajivassiliou VA, Phillips PC (2010) Bimodal t-ratios: the impact of thick tails on inference. Economet J 13(2):271–289
Article MathSciNet Google Scholar
Haglund B (2007) Birthweight distributions by gestational age: comparison of LMP-based and ultrasound-based estimates of gestational age using data from the Swedish Birth Registry. Paediatr Perinat Epidemiol 21:72–78
Article Google Scholar
Kuss O (2013) The danger of dichotomizing continuous variables: a visualization. Teach Stat 35(2):78–79
Article Google Scholar
Lian IB, Chiu PF, Hsieh YC et al (2023) Can chronic kidney disease staging early predict outcome of large-artery ischemic stroke with impaired renal function? Ther Adv Chronic Dis 14:20406223231153564
Article Google Scholar
MacCallum RC, Zhang S, Preacher KJ et al (2002) On the practice of dichotomization of quantitative variables. Psychol Methods 7(1):19
Article Google Scholar
McMurray B, Aslin RN, Toscano JC (2009) Statistical learning of phonetic categories: insights from a computational approach. Dev Sci 12(3):369–378
Article Google Scholar
Nelson SP, Ramakrishnan V, Nietert PJ et al (2017) An evaluation of common methods for dichotomization of continuous variables to discriminate disease status. Commun Stat-Theory Methods 46(21):10823–10834
Article MathSciNet Google Scholar
Ofuya M, Sauzet O, Peacock JL (2014) Dichotomisation of a continuous outcome and effect on meta-analyses: illustration of the distributional approach using the outcome birthweight. Syst Rev 3:1–8
Article Google Scholar
Peacock JL, Sauzet O, Ewings SM et al (2012) Dichotomising continuous data while retaining statistical power using a distributional approach. Stat Med 31(26):3089–3103
Article MathSciNet Google Scholar
Pfister R, Schwarz KA, Janczyk M et al (2013) Good things peak in pairs: a note on the bimodality coefficient. Front Psychol 4:700. https://doi.org/10.3389/fpsyg.2013.00700
Article Google Scholar
Royston P, Altman DG, Sauerbrei W (2006) Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 25(1):127–141
Article MathSciNet Google Scholar
Sauzet O, Ofuya M, Peacock JL (2015) Dichotomisation using a distributional approach when the outcome is skewed. BMC Med Res Methodol 15(1):1–11
Article Google Scholar
Schilling MF, Watkins AE, Watkins W (2002) Is human height bimodal? Am Stat 56(3):223–229
Article MathSciNet Google Scholar
Tomarchio SD, Punzo A (2020) Dichotomous unimodal compound models: application to the distribution of insurance losses. J Appl Stat 47(13–15):2328–2353
Article MathSciNet Google Scholar
Zhou J, Mo Y, Li H et al (2013) Relationship between HbA1c and continuous glucose monitoring in Chinese population: a multicenter study. PLoS ONE 8(12):e83827
Article Google Scholar

Download references

Author information

Authors and Affiliations

Data Research Center, National Changhua University of Education, Changhua, Taiwan
Yan-ni Jhan
Institute of Statistics and Information Science, National Changhua University of Education, Changhua, Taiwan
Wan-cen Li, Shin-hui Ruan & Iebin Lian
Department of Mathematics, National Changhua University of Education, Changhua, Taiwan
Jia-jyun Sie

Authors

Yan-ni Jhan
View author publications
You can also search for this author in PubMed Google Scholar
Wan-cen Li
View author publications
You can also search for this author in PubMed Google Scholar
Shin-hui Ruan
View author publications
You can also search for this author in PubMed Google Scholar
Jia-jyun Sie
View author publications
You can also search for this author in PubMed Google Scholar
Iebin Lian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iebin Lian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 45 kb)

Appendix

For a Gaussian mixture of $N(\mu_{1} ,\sigma_{1}^{2} )$ and $N(\mu_{2} ,\sigma_{2}^{2} )$ with proportions γ and 1- γ, the bimodality coefficient ${\text{BC}} = \frac{{{\text{skewness}}^{2} + 1}}{{{\text{Kurtosis}} + 3}}$ can be expressed as

$$BC = \frac{{\frac{{((1 - \gamma )\gamma (\mu_{1} - \mu_{2} )((1 - 2\gamma )(\mu_{1} - \mu_{2} )^{2} + 3(\sigma_{1}^{2} - \sigma_{2}^{2} ))^{2} }}{{(\gamma ((1 - \gamma )(\mu_{1} - \mu_{2} )^{2} + \sigma_{1}^{2} - \sigma_{2}^{2} ) + \sigma_{2}^{2} )^{3} }} + 1}}{{\frac{{6\gamma (1 - \gamma )(A_{0} + A_{1} \gamma + A_{2} \gamma^{2} )}}{{B^{2} }} + 3}}$$

where

$$A_{0} = \frac{{(\mu_{1} - \mu_{2} )^{4} }}{6} + (\mu_{1} - \mu_{2} )^{2} (\sigma_{1}^{2} - \sigma_{2}^{2} ) + \frac{{(\sigma_{1} - \sigma_{2} )^{2} (\sigma_{1} + \sigma_{2} )^{2} }}{2}$$

$$A_{1} = - (\mu_{1} - \mu_{2} )^{2} \left( {(\mu_{1} - \mu_{2} )^{2} + 2(\sigma_{1} - \sigma_{2} )(\sigma_{1} + \sigma_{2} )} \right)$$

$$A_{2} = (\mu_{1} - \mu_{2} )^{4}$$

$$B = {\upgamma }((1 - {\upgamma })(\mu_{1} - \mu_{2} )^{2} + \sigma_{1}^{2} - \sigma_{2}^{2} ) + \sigma_{2}^{2} .$$

For an equal mixture of $N( - a,\sigma^{2} )$ and $N(a,\sigma^{2} )$, the BC formula can be simplified as

$$BC = \frac{1}{{ - 2a^{4} /(a^{2} + \sigma^{2} )^{2} + 3}}.$$

Furthermore, for a single standard normal ($a$ = 0), we obtain BC = 1/3, and for a bimodal distribution with $a = 1$ and $\sigma = 1$, we obtain BC = 0.4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jhan, Yn., Li, Wc., Ruan, Sh. et al. Optimal dichotomization of bimodal Gaussian mixtures. Stat Papers (2024). https://doi.org/10.1007/s00362-023-01521-1

Download citation

Received: 12 June 2023
Revised: 27 September 2023
Accepted: 20 November 2023
Published: 02 January 2024
DOI: https://doi.org/10.1007/s00362-023-01521-1

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimal dichotomization of bimodal Gaussian mixtures

Abstract

Similar content being viewed by others

Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model

An Overview on the URV Model-Based Approach to Cluster Mixed-Type Data

The impact of ordinal scales on Gaussian mixture recovery

1 Introduction