Decline of Pearson’s r with categorization of variables: a large-scale simulation

It is often said that correlation coefficients computed from categorical variables are biased and thus should not be used. However, practitioners often ignore this longstanding caveat from statisticians. Although some studies have examined the bias, the true extent is still unknown. This study is an extensive attempt to determine the range and degree of the biases. In our simulation, continuous variables were categorized according to various thresholds and used to compute Pearson’s r. The results indicated that there were more serious biases than highlighted in previous studies. The results also revealed that increasing data size did not reduce the biases. Possible ways to cope with the biases are discussed.


Introduction
It is a common practice in social sciences to compute Pearson's correlation coefficient r from ordered categories by assigning integers to the categories, as in a Likert scale. In fact, Karl Pearson, who defined the coefficient, computed r from categorized variables but he noticed that r is biased when the number of categories is small and, therefore, "broad". He proposed some remedies to address this issue (Pearson 1913). Ritchie-Scott (1918) then proposed the polychoric correlation coefficient and Pearson and Pearson (1922) improved it. However, an executable version of the polychoric correlation coefficient took a long time to appear (Olsson 1979). Despite the longstanding caveat by psychometricians, very few people attempt to use the polychoric correlation coefficient. Evidently, researchers do understand the importance of the categorization bias. In marketing science, simulations have been conducted to examine the extent of biases (Morrison 1972;Martin 1973Martin , 1978. According to Martin (1978, p. 307), "the amount of lost information is substantial". In sociology, Bollen and Barb (1981) also conducted simulation studies contrasting the correlation between two original continuous variables and their categorized versions. They concluded that the differences are generally small, but grow when there is high correlation between original continuous variables and the number of categories is small.
These studies seem to have correctly described the global tendency of the biases but have failed to incorporate two important points. First, few studies considered the situation in which the number of categories is different. Shiina et al. (2012) proved that when different numbers of ordered categories are used, Pearson's r cannot be − 1 or 1 when: (1) variable X has m (≥ 2) ordered categories and variable Y has n (≥ 2) ordered categories, (2) n ≠ m, and (3) these categories are used at least once. A simpler new proof is as follows. If all the data are on an oblique line, then r = 1 and vice versa. If all the data are on the line, then the number of orthogonal images of the data on X and Y axes should be identical. Therefore, r = 1 implies that the number of such images should be identical. From the contrapositive of the proposition, we can conclude that if the numbers of orthogonal images on both axes (the number of categories) are not the same, r cannot be 1. In view of this proof, it is imperative to pay close attention to the situation in which the number of categories is different.
Second, past studies have not extensively examined the effect of the arrangement of thresholds at which original continuous variables are converted into categorized (or integer-valued) variables. This is important because a disorderly arrangement of thresholds can easily destroy the structure of the original continuous distribution. This paper examines the effects of conversion of continuous variables into categorized ones on the decline of the correlation coefficient, using different numbers of categories and various thresholds. We will first demonstrate how categorized variables with different numbers of categories and disorderly thresholds yield large biases of r. Then, we will run a large-scale simulation and report the full extent of biases of r.

Assumptions on the data generating process
Let x and y be two continuous latent variables obeying a bivariate normal distribution (BND): It is assumed that the original variables are categorized and yield manifest variables X and Y. We should consider the number of categories (m for X and n for Y), as well as the arrangement of thresholds, because how we divide the original continuous latent variables into categories will strongly affect the extent of biases. We can set x = y = 0 and x = y = 1 without loss of generality and can define thresholds for x and y as The true probability ij of each cell in the contingency table (correlation table) corresponds to the rectangular region [ i−1 , 1 ]×[ j−1 , j ] in the x-y space and is given by Figure 1 illustrates the original BND and an example of true probabilities of each cell in the contingency table.

Expected r when computing from categorized variables
We can compute the expected values of r with categorized variables using true probabilities of each cell. Expected r is given by .  Figure 2 shows two general tendencies. One is that a smaller number of categories increase biases of r and the other is that biases of r become greater as ρ increases.
The computation of expected r in Fig. 2 used equalized categories and we did not fully consider the location of thresholds. While equalized categories are partitioned by "well-organized" thresholds, there is some type of the arrangement of thresholds that destroys the structure of original continuous distribution. Such "ill-organized" thresholds will induce more serious biases and Fig. 3 shows an example that categorizing continuous variables with ill-organized thresholds ( 1 = −2, 2 = 0; 1 = −1, 2 = 1.5, 3 = 2 ) generates a severe decline of the correlation coefficient compared to well-organized thresholds shown in Fig. 1.
By "well-organized thresholds," we mean a set of thresholds that keeps the properties of the original BND, which includes symmetric and single-peaked shape, no void region in the center of the distribution, and no-overconcentration. By "ill-organized thresholds," we mean the opposite, that is, a set of thresholds that yields asymmetry, multiple-peaks, voids, concentrations, monotone decreasing, or increasing. Because these properties are qualitative and vague and thus are difficult to represent numerically, we present some examples for further understanding. Figure 4 illustrates how some sets of ill-organized thresholds destroy the original structure of BND, which results in the decline of r. It is noted that asymmetry of the distributions of two categorized variables causes massive decline of r as in the left panel of Fig. 4 and overconcentration into one cell also decreases r considerably as in Fig. 4 right panel.

Simulation
In our simulation, four factors were manipulated: ρ of BND, ϕ (x, y | 0, 0, 1, 1, ρ), data size, the number of categories, and the thresholds. We generated a pair of random numbers (x and y) from BND. We then categorized the two continuous variables into two integer-valued (Likert) variables X and Y and computed the correlation between the categorized variables. Table 1 shows the factors and the levels in our simulation.
Because one of the aims of our simulation is to examine how various threshold settings affect the bias of correlation, we set up two threshold settings. One setting used thresholds from continuous uniform distribution. This was because uniform distribution is ordinarily used when no reasonable prior information is available. More precisely, random numbers were generated from continuous uniform distribution U(− 1, 1) and then they were arranged in ascending order. We call this "the uniform setting" for short. According to a standard result from order statistics (David and Nagaraja 2003), kth threshold is beta-distributed with where l denotes the number of thresholds. Therefore, the locations of thresholds tend to be systematic while the thresholds in middle position will have a large variance.
The other setting used thresholds with small "noise," which is a random threshold version of the situation in the right panel of Fig. 2, which will be more familiar to psychometricians (the Law of Categorical Judgment, Torgerson 1958). We call this "the equal setting" for short. To avoid possible crossovers of the fluctuated thresholds, truncated normal (TN) was used. The general form of TN probability density is given by where (⋅) is the standard normal distribution, (⋅) is the cumulative standard normal distribution, and [a, b] is the domain. In the current situation, we set = −1 + 2 m i using (5), and 2 = (0.05) 2 , a = − 1 m , and b = + 1 m to represent the probability distribution of threshold i . Therefore, we have which means that the mean, mode, median of the probabilistic variable i are the same due to the positioning of , a, b such that b − = − a = 1∕m . The restriction on thresholds: is always conserved in this sampling scheme. The probabilistic variable j was determined in a similar manner.

Thresholds setting
The uniform setting 10,000 pairs of random threshold sets were determined as follows: (m − 1) and (n − 1) random numbers were generated from continuous uniform distribution U(− 1, 1). They were then arranged in ascending order.
Pairs of these arranged sets were used for categorizing x and y into X

and Y
The equal setting 10,000 pairs of random threshold sets were determined as follows: First, (m − 1) and (n − 1) points are defined as in (5)  All the sets of thresholds generated from equal setting can be generated (with different probability) from uniform setting but not vice versa.
In both threshold settings, we used a range [− 1, 1] for generating a set of thresholds for the following reasons. First, since overconcentration causes considerable decline of r as in Fig. 4, it is fruitless to set too large of a range, [− 20, 20] for example, that likely induces an overconcentration and voids in both sampling schema. Moreover, it might produce zero variance, which will induce division by zero in (4). Second, this study is a first attempt to examine the threshold effect on the decline of r; therefore, it is somewhat arbitrary because we should start from somewhere.
In each combination of four factors (ρ, data size, pairs of the number of categories, threshold settings), we computed Pearson's correlation coefficient 1000 times. In this way, we utilized a total number of 16.38 billion (= 13 ρs × 3 data size × 21 pairs of the number of categories × 2 threshold settings × 10,000 threshold sets × 1000 times) correlation coefficients between two categorized variables.

Results
The average values of r between categorized variables are depicted in Fig. 5. Regardless of data size, category size, and threshold location, average value of r showed robust underestimation of ρ, but the pushdown bias decreased as the number of categories increased. Not surprisingly, the extent of bias differed between two thresholds settings.
Comparing two threshold settings, the uniform setting caused more serious decline of r. For example, in the case where m = 3, n = 4, ρ = 0.9, and data size is 1024, the average value of r in the uniform setting was 0.726 while the value was 0.818 in the equal setting.
Compared with expected values of r with well-organized thresholds in the right panel of Fig. 2, while no marked discrepancies were observed in the equal setting except for special cases where ρ = 1.0, there were substantial declines The cause of greater bias in the uniform setting could be that the simulation results include both well-organized and ill-organized thresholds. The uniform setting allows a set of thresholds to be ill-organized. For example, when the distance of thresholds is very close, a resulting contingency table tends to include an empty or almost empty category. In addition, when all the values of thresholds approach to upper or lower limits, a resulting table tends to be asymmetric. It is reasonable that such transformation of the original distribution causes considerable decline of r as indicated in Fig. 4, though the uniform setting also allows well-organized thresholds. On the other hand, such destructive transformation of the distribution is not possible in the equal setting. Therefore, it is plausible that the difference between two settings is derived from whether the setting tends to allow ill-organized thresholds. Figure 6 shows the effect of data size on variations (sampling distribution) of r where m = 6 and n = 7 in the uniform setting. It is revealed that larger data size decreases variations of r but does not shift the central position of distribution. This means that increasing data size does not reduce systematic biases of r, but it only accurately estimates biased r. This demonstrates a simple fact that, because a categorization is a non-linear transformation, as soon as we transform an original 1 3 continuous distribution into a categorized distribution, we have two different distributions and parameters estimated from different distributions are not generally the same.
It is very difficult to know the true locations of thresholds in real research situations. At the same time, it seems very reasonable to postulate that the locations of thresholds are different from person to person or from situation to situation. Therefore, an implication of the simulation results is that Pearson's correlation coefficient between categorized variables will decrease more if we consider a variety of data acquisition procedures and variety of threshold locations.

Conclusion
This study ran a large-scale simulation regarding biases of r when using categorized variables, carefully manipulating threshold locations. The results have shown that more serious biases of r occurred when thresholds are ill-organized. The findings suggest that previous simulation studies may have underestimated biases of r, and users of Likert-scales in social science should take the biases caused by categorized variables more seriously. Otherwise, biased values of r would result in incorrect interpretations of obtained data.
One of the possible ways to cope with the biases is the use of polychoric correlation. Estimation procedures of polychoric correlation were proposed by Olsson (1979) using maximum likelihood procedures and by Shiina et al. (2018) using the EM algorithm, although the use of polychoric correlation is not common in psychology.
There may be some limitations in this study. First, we have paid attention only to Pearson's r, not to other kinds of correlations (such as polychoric correlation or Spearman's rank correlation). Therefore, further studies are needed to examine the extent of the biases of different types of correlations. Second, our simulation has not completely examined the possible arrangement of thresholds. Although we set upper and lower limits [− 1, 1] for generating a set of thresholds, other upper and lower limits should also be considered. Such considerations will provide insights into the nature of the bias.
Funding This work was supported by JSPS KAKENHI Grant Numbers: 16H02050 and 18K03048.

Compliance with ethical standards
Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.