Decline of Pearson’s r with categorization of variables: a large-scale simulation
- 113 Downloads
Abstract
It is often said that correlation coefficients computed from categorical variables are biased and thus should not be used. However, practitioners often ignore this longstanding caveat from statisticians. Although some studies have examined the bias, the true extent is still unknown. This study is an extensive attempt to determine the range and degree of the biases. In our simulation, continuous variables were categorized according to various thresholds and used to compute Pearson’s r. The results indicated that there were more serious biases than highlighted in previous studies. The results also revealed that increasing data size did not reduce the biases. Possible ways to cope with the biases are discussed.
Keywords
Correlation coefficient Categorization bias Number of categories Likert scale1 Introduction
It is a common practice in social sciences to compute Pearson’s correlation coefficient r from ordered categories by assigning integers to the categories, as in a Likert scale. In fact, Karl Pearson, who defined the coefficient, computed r from categorized variables but he noticed that r is biased when the number of categories is small and, therefore, “broad”. He proposed some remedies to address this issue (Pearson 1913). Ritchie-Scott (1918) then proposed the polychoric correlation coefficient and Pearson and Pearson (1922) improved it. However, an executable version of the polychoric correlation coefficient took a long time to appear (Olsson 1979). Despite the longstanding caveat by psychometricians, very few people attempt to use the polychoric correlation coefficient.
Evidently, researchers do understand the importance of the categorization bias. In marketing science, simulations have been conducted to examine the extent of biases (Morrison 1972; Martin 1973, 1978). According to Martin (1978, p. 307), “the amount of lost information is substantial”. In sociology, Bollen and Barb (1981) also conducted simulation studies contrasting the correlation between two original continuous variables and their categorized versions. They concluded that the differences are generally small, but grow when there is high correlation between original continuous variables and the number of categories is small.
These studies seem to have correctly described the global tendency of the biases but have failed to incorporate two important points. First, few studies considered the situation in which the number of categories is different. Shiina et al. (2012) proved that when different numbers of ordered categories are used, Pearson’s r cannot be − 1 or 1 when: (1) variable X has m (≥ 2) ordered categories and variable Y has n (≥ 2) ordered categories, (2) n ≠ m, and (3) these categories are used at least once. A simpler new proof is as follows. If all the data are on an oblique line, then r = 1 and vice versa. If all the data are on the line, then the number of orthogonal images of the data on X and Y axes should be identical. Therefore, r = 1 implies that the number of such images should be identical. From the contrapositive of the proposition, we can conclude that if the numbers of orthogonal images on both axes (the number of categories) are not the same, r cannot be 1. In view of this proof, it is imperative to pay close attention to the situation in which the number of categories is different.
Second, past studies have not extensively examined the effect of the arrangement of thresholds at which original continuous variables are converted into categorized (or integer-valued) variables. This is important because a disorderly arrangement of thresholds can easily destroy the structure of the original continuous distribution.
This paper examines the effects of conversion of continuous variables into categorized ones on the decline of the correlation coefficient, using different numbers of categories and various thresholds. We will first demonstrate how categorized variables with different numbers of categories and disorderly thresholds yield large biases of r. Then, we will run a large-scale simulation and report the full extent of biases of r.
2 Assumptions on the data generating process
3 Expected r when computing from categorized variables
Figure 2 shows two general tendencies. One is that a smaller number of categories increase biases of r and the other is that biases of r become greater as ρ increases.
By “well-organized thresholds,” we mean a set of thresholds that keeps the properties of the original BND, which includes symmetric and single-peaked shape, no void region in the center of the distribution, and no-overconcentration. By “ill-organized thresholds,” we mean the opposite, that is, a set of thresholds that yields asymmetry, multiple-peaks, voids, concentrations, monotone decreasing, or increasing.
4 Simulation
In our simulation, four factors were manipulated: ρ of BND, ϕ (x, y | 0, 0, 1, 1, ρ), data size, the number of categories, and the thresholds.
Factors and levels in the simulation
Factors | Levels | Note |
---|---|---|
ρ | 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.96, 0.98,1.0 | |
Data size | 64, 256, 1024 | |
Number of categories | m2n2, m2n3, m2n4, m2n5, m2n6, m2n7, m3n3, m3n4, m3n5, m3n6, m3n7, m4n4, m4n5, m4n6, m4n7, m5n5, m5n6, m5n7, m6n6, m6n7, m7n7 | m for X, n for Y |
Thresholds setting | The uniform setting 10,000 pairs of random threshold sets were determined as follows: (m − 1) and (n − 1) random numbers were generated from continuous uniform distribution U(− 1, 1). They were then arranged in ascending order. Pairs of these arranged sets were used for categorizing x and y into X and Y | |
The equal setting 10,000 pairs of random threshold sets were determined as follows: First, (m − 1) and (n − 1) points are defined as in (5) or (6). Then, to each point, a random number from doubly truncated normal distribution TN (0, 0.05^{2}, a, b), where [a, b] is the domain, \(a = -1/m\) and \(b = 1/m\) for X and \(a = -1/n\) and \(b = 1/n\) for Y, was added to represent a fluctuating threshold |
The uniform and the equal settings are completely different sampling methods that generate different multidimensional distributions of thresholds. A uniform setting can produce more pathologically ill-organized thresholds than an equal setting. For example, a set of thresholds (\(- \infty , 0.81, 0,82, 0.83, \infty\)), when m = 4, is possible only for the uniform setting. All the sets of thresholds generated from equal setting can be generated (with different probability) from uniform setting but not vice versa.
In both threshold settings, we used a range [− 1, 1] for generating a set of thresholds for the following reasons. First, since overconcentration causes considerable decline of r as in Fig. 4, it is fruitless to set too large of a range, [− 20, 20] for example, that likely induces an overconcentration and voids in both sampling schema. Moreover, it might produce zero variance, which will induce division by zero in (4). Second, this study is a first attempt to examine the threshold effect on the decline of r; therefore, it is somewhat arbitrary because we should start from somewhere.
In each combination of four factors (ρ, data size, pairs of the number of categories, threshold settings), we computed Pearson’s correlation coefficient 1000 times. In this way, we utilized a total number of 16.38 billion (= 13 ρs × 3 data size × 21 pairs of the number of categories × 2 threshold settings × 10,000 threshold sets × 1000 times) correlation coefficients between two categorized variables.
5 Results
Comparing two threshold settings, the uniform setting caused more serious decline of r. For example, in the case where m = 3, n = 4, ρ = 0.9, and data size is 1024, the average value of r in the uniform setting was 0.726 while the value was 0.818 in the equal setting.
Compared with expected values of r with well-organized thresholds in the right panel of Fig. 2, while no marked discrepancies were observed in the equal setting except for special cases where ρ = 1.0, there were substantial declines in the average values of r in the uniform setting. For example, in the case where m = 4, n = 5, and ρ = 0.8, the expected r with well-organized thresholds (\(\theta_{1} = -\, 0.5, \theta_{2} = 0, \theta_{3} = 0.5 ;\tau_{1} = - \,0.6, \tau_{2} = -\, 0.2, \tau_{3} = 0.2, \tau_{4} = 0.6\)) is 0.726, while the average value of r in the uniform setting, where data size is 1024, is 0.660. To provide another example, in the case where m = 5, n = 7, and ρ = 0.8, the expected r with well-organized thresholds (\(\theta_{1} = - \,0.6, \theta_{2} = - \,0.2, \theta_{3} = 0.2, \theta_{4} = 0.6 ;\tau_{1} = - \,0.71, \tau_{2} = - \,0.43, \tau_{3} = - \,0.14, \tau_{4} = 0.14, \tau_{5} = 0.43, \tau_{6} = 0.71\)) is 0.688, whereas in the uniform setting, where data size is 1024, it is 0.635.
The cause of greater bias in the uniform setting could be that the simulation results include both well-organized and ill-organized thresholds. The uniform setting allows a set of thresholds to be ill-organized. For example, when the distance of thresholds is very close, a resulting contingency table tends to include an empty or almost empty category. In addition, when all the values of thresholds approach to upper or lower limits, a resulting table tends to be asymmetric. It is reasonable that such transformation of the original distribution causes considerable decline of r as indicated in Fig. 4, though the uniform setting also allows well-organized thresholds. On the other hand, such destructive transformation of the distribution is not possible in the equal setting. Therefore, it is plausible that the difference between two settings is derived from whether the setting tends to allow ill-organized thresholds.
It is very difficult to know the true locations of thresholds in real research situations. At the same time, it seems very reasonable to postulate that the locations of thresholds are different from person to person or from situation to situation. Therefore, an implication of the simulation results is that Pearson’s correlation coefficient between categorized variables will decrease more if we consider a variety of data acquisition procedures and variety of threshold locations.
6 Conclusion
This study ran a large-scale simulation regarding biases of r when using categorized variables, carefully manipulating threshold locations. The results have shown that more serious biases of r occurred when thresholds are ill-organized. The findings suggest that previous simulation studies may have underestimated biases of r, and users of Likert-scales in social science should take the biases caused by categorized variables more seriously. Otherwise, biased values of r would result in incorrect interpretations of obtained data.
One of the possible ways to cope with the biases is the use of polychoric correlation. Estimation procedures of polychoric correlation were proposed by Olsson (1979) using maximum likelihood procedures and by Shiina et al. (2018) using the EM algorithm, although the use of polychoric correlation is not common in psychology.
There may be some limitations in this study. First, we have paid attention only to Pearson’s r, not to other kinds of correlations (such as polychoric correlation or Spearman’s rank correlation). Therefore, further studies are needed to examine the extent of the biases of different types of correlations. Second, our simulation has not completely examined the possible arrangement of thresholds. Although we set upper and lower limits [− 1, 1] for generating a set of thresholds, other upper and lower limits should also be considered. Such considerations will provide insights into the nature of the bias.
Notes
Funding
This work was supported by JSPS KAKENHI Grant Numbers: 16H02050 and 18K03048.
Compliance with ethical standards
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
References
- Bollen KA, Barb KH (1981) Pearson’s r and coarsely categorized measures. Am Sociol Rev 46:232–239CrossRefGoogle Scholar
- David HA, Nagaraja HN (2003) Order statistics, 3rd edn. Wiley, New JerseyCrossRefzbMATHGoogle Scholar
- Martin WS (1973) The effects of scaling on the correlation coefficient: a test of validity. J Mark Res 10(3):316–318CrossRefGoogle Scholar
- Martin WS (1978) Effects of scaling on the correlation coefficient: additional considerations. J Mark Res 15(2):304–308CrossRefGoogle Scholar
- Morrison DG (1972) Regressions with discrete dependent variables: the effect on R2. J Mark Res 9(3):338–340Google Scholar
- Olsson U (1979) Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 44(4):443–460MathSciNetCrossRefzbMATHGoogle Scholar
- Pearson K (1913) On the measurement of the influence of “Broad Categories” on correlation. Biometrika 9:116–139CrossRefGoogle Scholar
- Pearson K, Pearson ES (1922) On polychoric coefficients of correlation. Biometrika 14:127–156CrossRefGoogle Scholar
- Ritchie-Scott A (1918) The correlation coefficient of a polychoric table. Biometrika 12:93–133CrossRefGoogle Scholar
- Shiina K, Ouchi Y, Kubo S, Ueda T (2012) A dreadful secret of Pearson’s r. The 76th annual convention of Japanese Psychological Association. https://psych.or.jp/meeting/proceedings/76/contents/pdf/1EVA14.pdf. Accessed 13 Feb 2019
- Shiina K, Ueda T, Kubo S (2018) Polychoric correlations for ordered categories using the EM algorithm. In: Wiberg M et al (eds) Quantitative psychology. Springer Proceedings in Mathematics & Statistics, pp 247–259Google Scholar
- Torgerson WS (1958) Theory and methods of scaling. Wiley, New YorkGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.