Abstract
When applying supervised machine learning algorithms to classification, the classical goal is to reconstruct the true labels as accurately as possible. However, if the predictions of an accurate algorithm are aggregated, for example by counting the predictions of a single class label, the result is often still statistically biased. Implementing machine learning algorithms in the context of official statistics is therefore impeded. The statistical bias that occurs when aggregating the predictions of a machine learning algorithm is referred to as misclassification bias. In this paper, we focus on reducing the misclassification bias of binary classification algorithms by employing five existing estimation techniques, or estimators. As reducing bias might increase variance, the estimators are evaluated by their mean squared error (MSE). For three of the estimators, we are the first to derive an expression for the MSE in finite samples, complementing the existing asymptotic results in the literature. The expressions are then used to compute decision boundaries numerically, indicating under which conditions each of the estimators is optimal, i.e., has the lowest MSE. Our main conclusion is that the calibration estimator performs best in most applications. Moreover, the calibration estimator is unbiased and it significantly reduces the MSE compared to that of the uncorrected aggregated predictions, supporting the use of machine learning in the context of official statistics.
The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands. The authors would like to thank Arnout van Delden and three anonymous referees for their useful comments on previous versions of this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The results in this section have been obtained using the statistical software R. All visualizations have been implemented in a Shiny dashboard, which in addition includes interactive 3D-plots of the RMSE surface for each of the estimators. The code, together with the Appendix A, can be retrieved from https://github.com/kevinkloos/Misclassification-Bias.
References
Buonaccorsi, J.P.: Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC, Boca Raton (2010)
Burger, J., Delden, A.v., Scholtus, S.: Sensitivity of mixed-source statistics to classification errors. J. Offic. Stat. 31(3), 489–506 (2015). https://doi.org/10.1515/jos-2015-0029
Curier, R., et al.: Monitoring spatial sustainable development: semi-automated analysis of satellite and aerial images for energy transition and sustainability indicators. arXiv preprint arXiv:1810.04881 (2018)
Czaplewski, R.L.: Misclassification bias in areal estimates. Photogram. Eng. Remote Sens. 58(2), 189–192 (1992)
Czaplewski, R.L., Catts, G.P.: Calibration of remotely sensed proportion or area estimates for misclassification error. Remote Sens. Environ. 39(1), 29–43 (1992). https://doi.org/10.1016/0034-4257(92)90138-A
González, P., Castaño, A., Chawla, N.V., Coz, J.J.D.: A review on quantification learning. ACM Comput. Surv. 50(5), 74:1–74:40 (2017). https://doi.org/10.1145/3117807
Grassia, A., Sundberg, R.: Statistical precision in the calibration and use of sorting machines and other classifiers. Technometrics 24(2), 117–121 (1982). https://doi.org/10.1080/00401706.1982.10487732
Greenland, S.: Sensitivity analysis and bias analysis. In: Ahrens, W., Pigeot, I. (eds.) Handbook of Epidemiology, pp. 685–706. Springer, New York (2014). https://doi.org/10.1007/978-0-387-09834-0_60
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54(1), 229–247 (2010). https://doi.org/10.1111/j.1540-5907.2009.00428.x
Knottnerus, P.: Sample Survey Theory: Some Pythagorean Perspectives. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21764-2
Kuha, J., Skinner, C.J.: Categorical data analysis and misclassification. In: Lyberg, L., et al. (eds.) Survey Measurement and Process Quality, pp. 633–670. Wiley, New York (1997)
Löw, F., Knöfel, P., Conrad, C.: Analysis of uncertainty in multi-temporal object-based classification. ISPRS J. Photogramm. Remote Sens. 105, 91–106 (2015). https://doi.org/10.1016/j.isprsjprs.2015.03.004
Meertens, Q.A., Diks, C.G.H., Herik, H.J.v.d., Takes, F.W.: A data-driven supply-side approach for estimating cross-border internet purchases within the European union. J. Royal Stat. Soc. Ser. A (Stat. Soc.) 183(1), 61–90 (2020). https://doi.org/10.1111/rssa.12487
Meertens, Q.A., Diks, C.G.H., Herik, H.J.v.d., Takes, F.W.: A Bayesian approach for accurate classification-based aggregates. In: Berger-Wolf, T.Y., et al. (eds.), Proceedings of the 19th SIAM International Conference on Data Mining, pp. 306–314 (2019). https://doi.org/10.1137/1.9781611975673.35
Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019
O’Connor, B., Balasubramanyan, R., Routledge, B., Smith, N.: From Tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, Washington, DC (2010)
Scholtus, S., Delden, A.v.: On the accuracy of estimators based on a binary classifier, Discussion Paper No. 202007, Statistics Netherlands, The Hague (2020)
Schwartz, J.E.: The neglected problem of measurement error in categorical data. Soc. Methods Res. 13(4), 435–466 (1985). https://doi.org/10.1177/0049124185013004001
Strichartz, R.S.: The Way of Analysis. Jones & Bartlett Learning, Sudbury (2000)
Delden, A.v., Scholtus, S., Burger, J.: Accuracy of mixed-source statistics as affected by classification errors. J. Official Stat. 32(3), 619–642 (2016). https://doi.org/10.1515/jos-2016-0032
Wiedemann, G.: Proportional classification revisited: automatic content analysis of political manifestos using active learning. Soc. Sci. Comput. Rev. 37(2), 135–159 (2019). https://doi.org/10.1177/0894439318758389
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendix
Appendix
This appendix contains the proofs of the theorems presented in the paper entitled “Comparing Correction Methods to Reduce Misclassification Bias”. Recall that we have assumed a population of size N in which a fraction \(\alpha := N_{1+} / N\) belongs to the class of interest, referred to as the class labelled as 1. We assume that a binary classification algorithm has been trained that correctly classifies a data point that belongs to class \(i \in \{0,1\}\) with probability \(p_{ii} > 0.5\), independently across all data points. In addition, we assume that a test set of size \(n \ll N\) is available and that it can be considered a simple random sample from the population. The classification probabilities \(p_{00}\) and \(p_{11}\) are estimated on that test set as described in Sect. 2. Finally, we assume that the classify-and-count estimator \(\hat{\alpha }^{*}\) is distributed independently of \(\hat{p}_{00}\) and \(\hat{p}_{11}\), which is reasonable (at least as an approximation) when \(n \ll N\).
It may be noted that the estimated probabilities \(\hat{p}_{11}\) and \(\hat{p}_{00}\) defined in Sect. 2 cannot be computed if \(n_{1+} = 0\) or \(n_{0+} = 0\). Similarly, the calibration probabilities \(c_{11}\) and \(c_{00}\) cannot be estimated if \(n_{+1} = 0\) or \(n_{+0} = 0\). We assume here that these events occur with negligible probability. This will be true when n is sufficiently large so that \(n \alpha \gg 1\) and \(n (1 - \alpha ) \gg 1\).
Preliminaries
Many of the proofs presented in this appendix rely on the following two mathematical results. First, we will use univariate and bivariate Taylor series to approximate the expectation of non-linear functions of random variables. That is, to estimate E[f(X)] and E[g(X, Y)] for sufficiently differentiable functions f and g, we will insert the Taylor series for f and g at \(x_0 = E[X]\) and \(y_0 = E[Y]\) up to terms of order 2 and utilize the linearity of the expectation. Second, we will use the following conditional variance decomposition for the variance of a random variable X:
The conditional variance decomposition follows from the tower property of conditional expectations [10]. Before we prove the theorems presented in the paper, we begin by proving the following lemma.
Lemma 1
The variance of the estimator \(\hat{p}_{11}\) for \(p_{11}\) estimated on the test set is given by
Similarly, the variance of \(\hat{p}_{00}\) is given by
Moreover, \(\hat{p}_{11}\) and \(\hat{p}_{00}\) are uncorrelated: \(C(\hat{p}_{11},\hat{p}_{00}) = 0\).
Proof
(of Lemma 1). We approximate the variance of \(\hat{p}_{00}\) using the conditional variance decomposition and a second-order Taylor series, as follows:
The variance of \(\hat{p}_{11}\) is approximated in the exact same way.
Finally, to evaluate \(C(\hat{p}_{11},\hat{p}_{00})\) we use the analogue of Eq. (19) for covariances:
The second term is zero as before. The first term also vanishes because, conditional on the row totals \(n_{1+}\) and \(n_{0+}\), the counts \(n_{11}\) and \(n_{00}\) follow independent binomial distributions, so \(C(n_{11},n_{00}\mid n_{1+},n_{0+}) = 0\).
Note: in the remainder of this appendix, we will not add explicit subscripts to expectations and variances when their meaning is unambiguous.
Subtracted-Bias Estimator
We will now prove the bias and variance approximations for the subtracted-bias estimator \(\hat{\alpha }_b\) that was defined in Eq. (9).
Proof
(of Theorem 1). The bias of \(\hat{\alpha }_b\) is given by
Because \(\hat{\alpha }^*\) and (\(\hat{p}_{00} + \hat{p}_{11} - 2)\) are assumed to be independent, the expectation of their product equals the product of their expectations:
This proves the formula for the bias of \(\hat{\alpha }_b\) as estimator for \(\alpha \). To approximate the variance of \(\hat{\alpha }_b\), we apply the conditional variance decomposition of Eq. (19) conditional on \(\hat{\alpha }^*\) and look at the two resulting terms separately. First, consider the expectation of the conditional variance:
In the penultimate line, we used that \(C (\hat{p}_{11},\hat{p}_{00}) = 0\). The second moment \(E \left[ (\hat{\alpha }^*)^2 \right] \) can be written as \(E \left[ \hat{\alpha }^*\right] ^2 + V(\hat{\alpha }^*)\). Because \(V(\hat{\alpha }^*)\) is of order 1/N, it can be neglected compared to \(E \left[ \hat{\alpha }^* \right] ^2\), which is of order 1. In particular, we find that the expectation of the conditional variance equals:
Next, the variance of the conditional expectation can be seen to be equal the following:
Because \(V(\hat{\alpha }^*)\) is of order 1/N, it can be neglected in the final formula. Furthermore, the variances of \(\hat{p}_{00}\) and \(\hat{p}_{11}\) can be written out using the result from Lemma 1:
This concludes the proof of Theorem 1.
Misclassification Estimator
We will now prove the bias and variance approximations for the misclassification estimator \(\hat{\alpha }_p\) as defined in Eq. (12).
Proof
(of Theorem 2). Under the assumption that \(\hat{\alpha }^{*}\) is distributed independently of \((\hat{p}_{00}, \hat{p}_{11})\), it holds that
\(E(\hat{\alpha }^{*})\) is known from Eq. (4). To evaluate the other two expectations, we use a second-order Taylor series approximation. The first- and second-order partial derivatives of \(f(x, y) = 1/(x + y - 1)\) and \(g(x, y) = (x - 1)/(x + y - 1) = 1 - [y/(x + y - 1)]\) are given by:
Now also using that \(C(\hat{p}_{11},\hat{p}_{00}) = 0\), we obtain for the first expectation:
Here, we have included only the first term of the approximations to \(V(\hat{p}_{00})\) and \(V(\hat{p}_{11})\) from Lemma 1, since this suffices to approximate the bias up to terms of order O(1/n). Similarly, for the second expectation we obtain:
Using Eqs. (22), (4), (26) and (27), we conclude that:
From this, it follows that an approximation to the bias of \(\hat{\alpha }_p\) that is correct up to terms of order O(1/n) is given by:
By expanding the products in this expression and combining similar terms, the expression can be simplified to:
Finally, using the identity \(p_{11}(1-p_{11}) - p_{00}(1-p_{00}) = (p_{00}+p_{11}-1)(p_{00}-p_{11})\), we obtain the required result for \(B(\hat{\alpha }_p)\).
To approximate the variance of \(\hat{\alpha }_p\), we apply the conditional variance decomposition conditional on \(\hat{\alpha }^*\) and look at the two resulting terms separately. First, consider the variance of the conditional expectation:
where in the last line we used Eq. (6). Note: the factor \(1/(p_{00} + p_{11} - 1)^2\) can become arbitrarily large in the limit \(p_{00} + p_{11} \rightarrow 1\). It will be seen below that this same factor also occurs in the lower-order terms of \(V(\hat{\alpha }_p)\); hence, the relative contribution of Eq. (28) remains negligible even in the limit \(p_{00} + p_{11} \rightarrow 1\).
Next, we compute the expectation of the conditional variance.
To approximate the variance and covariance terms, we use a first-order Taylor series. Using the partial derivatives in Eqs. (23), (24) and (25), we obtain:
Substituting these terms into Eq. (29) and accounting for Eq. (28) yields:
Finally, inserting the expressions for \(V(\hat{p}_{00})\) and \(V(\hat{p}_{11})\) from Lemma 1 yields:
from which Eq. (14) follows. This concludes the proof of Theorem 2.
Calibration Estimator
We will now prove the bias and variance approximations for the calibration estimator \(\hat{\alpha }_c\) that was defined in Eq. (15).
Proof
(of Theorem 3). To compute the expected value of \(\hat{\alpha }_c\), we first compute its expectation conditional on the 4-vector \(\boldsymbol{N} = (N_{00},N_{01},N_{10},N_{11})\):
By the tower property of conditional expectations, it follows that \(E[\hat{\alpha }_c] = E \left[ E(\hat{\alpha }_c \mid \boldsymbol{N}) \right] = \alpha \). This proves that \(\hat{\alpha }_c\) is an unbiased estimator for \(\alpha \).
To compute the variance of \(\hat{\alpha }_c\), we use the conditional variance decomposition, again conditioning on the 4-vector \(\boldsymbol{N}\). We remark that \(N_{0+}\) and \(N_{1+}\) are deterministic values, but that \(N_{+0}\) and \(N_{+1}\) are random variables. As shown above in Eq. (30), the conditional expectation is deterministic, hence it has no variance: \(V(E[\hat{\alpha }_c \mid \boldsymbol{N}]) = 0\). The conditional variance decomposition then simplifies to the following:
The conditional variance \(V(\hat{\alpha }_c \mid \boldsymbol{N})\) can be written as follows:
We will consider these terms separately. First, the variance of \(n_{11}/n_{+1}\) can be computed by applying an additional conditional variance decomposition:
The first term is zero, which can be shown as follows:
For the second term, we find under the assumption that \(n \ll N\):
The expectation of \(\frac{1}{n_{+1}}\) can be approximated with a second-order Taylor series:
The variance of \(n_{10}/n_{+0}\) can be approximated in the same way, which yields the following expression:
Finally, it can be shown that the covariance in the final term is equal to zero:
Combining Eqs. (33), (34), (35), with Eq. (32) gives:
Recall from Eq. (31) that V \(\left[ \hat{\alpha }_c \right] \) = E \(\left[ V[\hat{\alpha }_c \mid \boldsymbol{N}]\right] \) = E \( \left[ \text {E} \left[ V[\hat{\alpha }_c \mid \boldsymbol{N}] \mid N_{+1} \right] \right] \). Hence,
To evaluate the expectations in this expression, we observe that, conditional on the column total \(N_{+1}\), \(N_{11}\) is distributed as \(Bin(N_{+1}, c_{11})\), where \(c_{11}\) is a calibration probability as defined in Section 2.5. Hence,
Similarly, since \(N = N_{+1} + N_{+0}\) is fixed,
Using these results, we obtain:
and similarly
Substituting Eqs. (39) and (40) into Eq. (36) and noting that \(N_{+1}^{2}/N^2 = (\hat{\alpha }^*)^2\) and \(N_{+0}^{2}/N^2 = (1 - \hat{\alpha }^*)^2\), we obtain:
Finally, substituting the expressions for \(E(\hat{\alpha }^*)\) from Eq. (4) and the expressions for \(c_{11}\) and \(c_{00}\) from Eqs. (37) and (38), the desired Eq. (17) is obtained. This concludes the proof of Theorem 3.
Comparing Mean Squared Errors
To conclude, we present the proof of Theorem 4, which essentially shows that the mean squared error (up to and including terms of order 1/n) of the calibration estimator is lower than that of the misclassification estimator.
Proof
(of Theorem 4). Recall that the bias of \(\hat{\alpha }_p\) as an estimator for \(\alpha \) is given by
Hence, \((B \left[ \hat{\alpha }_p \right] )^2 = O(1/n^2)\) is not relevant for \(\widetilde{MSE}[\hat{\alpha }_p]\). It follows that \(\widetilde{MSE}[\hat{\alpha }_p]\) is equal to the variance of \(\hat{\alpha }_p\) up to order 1/n. From Eq. (14) we obtain:
Recall that \(\hat{\alpha }_c\) is an unbiased estimator for \(\alpha \), i.e., \(B[\hat{\alpha }_c] = 0\). Also recall the notation \(\beta = (1-\alpha )(1-p_{00}) + \alpha p_{11}\). It follows from Eq. (17) that the variance, and hence the MSE, of \(\hat{\alpha }_c\) up to terms of order 1/n can be written as:
To prove Eq. (18), first note that
The numerator of this equation can be rewritten as follows:
Note that the obtained expression is equal to the numerator of Eq. (41). Write \(T = (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{11}(1-p_{11})\) for that expression. It follows that
Writing out the second factor in the last expression gives the following:
This concludes the proof of Theorem 4.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kloos, K., Meertens, Q., Scholtus, S., Karch, J. (2021). Comparing Correction Methods to Reduce Misclassification Bias. In: Baratchi, M., Cao, L., Kosters, W.A., Lijffijt, J., van Rijn, J.N., Takes, F.W. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2020. Communications in Computer and Information Science, vol 1398. Springer, Cham. https://doi.org/10.1007/978-3-030-76640-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-76640-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76639-9
Online ISBN: 978-3-030-76640-5
eBook Packages: Computer ScienceComputer Science (R0)