Comparing Correction Methods to Reduce Misclassification Bias

Kloos, Kevin; Meertens, Quinten; Scholtus, Sander; Karch, Julian

doi:10.1007/978-3-030-76640-5_5

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1398))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

527 Accesses
3 Citations

Abstract

When applying supervised machine learning algorithms to classification, the classical goal is to reconstruct the true labels as accurately as possible. However, if the predictions of an accurate algorithm are aggregated, for example by counting the predictions of a single class label, the result is often still statistically biased. Implementing machine learning algorithms in the context of official statistics is therefore impeded. The statistical bias that occurs when aggregating the predictions of a machine learning algorithm is referred to as misclassification bias. In this paper, we focus on reducing the misclassification bias of binary classification algorithms by employing five existing estimation techniques, or estimators. As reducing bias might increase variance, the estimators are evaluated by their mean squared error (MSE). For three of the estimators, we are the first to derive an expression for the MSE in finite samples, complementing the existing asymptotic results in the literature. The expressions are then used to compute decision boundaries numerically, indicating under which conditions each of the estimators is optimal, i.e., has the lowest MSE. Our main conclusion is that the calibration estimator performs best in most applications. Moreover, the calibration estimator is unbiased and it significantly reduces the MSE compared to that of the uncorrected aggregated predictions, supporting the use of machine learning in the context of official statistics.

The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands. The authors would like to thank Arnout van Delden and three anonymous referees for their useful comments on previous versions of this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The results in this section have been obtained using the statistical software R. All visualizations have been implemented in a Shiny dashboard, which in addition includes interactive 3D-plots of the RMSE surface for each of the estimators. The code, together with the Appendix A, can be retrieved from https://github.com/kevinkloos/Misclassification-Bias.

References

Buonaccorsi, J.P.: Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC, Boca Raton (2010)
Book Google Scholar
Burger, J., Delden, A.v., Scholtus, S.: Sensitivity of mixed-source statistics to classification errors. J. Offic. Stat. 31(3), 489–506 (2015). https://doi.org/10.1515/jos-2015-0029
Curier, R., et al.: Monitoring spatial sustainable development: semi-automated analysis of satellite and aerial images for energy transition and sustainability indicators. arXiv preprint arXiv:1810.04881 (2018)
Czaplewski, R.L.: Misclassification bias in areal estimates. Photogram. Eng. Remote Sens. 58(2), 189–192 (1992)
Google Scholar
Czaplewski, R.L., Catts, G.P.: Calibration of remotely sensed proportion or area estimates for misclassification error. Remote Sens. Environ. 39(1), 29–43 (1992). https://doi.org/10.1016/0034-4257(92)90138-A
Article Google Scholar
González, P., Castaño, A., Chawla, N.V., Coz, J.J.D.: A review on quantification learning. ACM Comput. Surv. 50(5), 74:1–74:40 (2017). https://doi.org/10.1145/3117807
Grassia, A., Sundberg, R.: Statistical precision in the calibration and use of sorting machines and other classifiers. Technometrics 24(2), 117–121 (1982). https://doi.org/10.1080/00401706.1982.10487732
Article Google Scholar
Greenland, S.: Sensitivity analysis and bias analysis. In: Ahrens, W., Pigeot, I. (eds.) Handbook of Epidemiology, pp. 685–706. Springer, New York (2014). https://doi.org/10.1007/978-0-387-09834-0_60
Chapter Google Scholar
Hopkins, D.J., King, G.: A method of automated nonparametric content analysis for social science. Am. J. Polit. Sci. 54(1), 229–247 (2010). https://doi.org/10.1111/j.1540-5907.2009.00428.x
Article Google Scholar
Knottnerus, P.: Sample Survey Theory: Some Pythagorean Perspectives. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21764-2
Book MATH Google Scholar
Kuha, J., Skinner, C.J.: Categorical data analysis and misclassification. In: Lyberg, L., et al. (eds.) Survey Measurement and Process Quality, pp. 633–670. Wiley, New York (1997)
Google Scholar
Löw, F., Knöfel, P., Conrad, C.: Analysis of uncertainty in multi-temporal object-based classification. ISPRS J. Photogramm. Remote Sens. 105, 91–106 (2015). https://doi.org/10.1016/j.isprsjprs.2015.03.004
Article Google Scholar
Meertens, Q.A., Diks, C.G.H., Herik, H.J.v.d., Takes, F.W.: A data-driven supply-side approach for estimating cross-border internet purchases within the European union. J. Royal Stat. Soc. Ser. A (Stat. Soc.) 183(1), 61–90 (2020). https://doi.org/10.1111/rssa.12487
Meertens, Q.A., Diks, C.G.H., Herik, H.J.v.d., Takes, F.W.: A Bayesian approach for accurate classification-based aggregates. In: Berger-Wolf, T.Y., et al. (eds.), Proceedings of the 19th SIAM International Conference on Data Mining, pp. 306–314 (2019). https://doi.org/10.1137/1.9781611975673.35
Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., Herrera, F.: A unifying view on dataset shift in classification. Pattern Recogn. 45(1), 521–530 (2012). https://doi.org/10.1016/j.patcog.2011.06.019
Article Google Scholar
O’Connor, B., Balasubramanyan, R., Routledge, B., Smith, N.: From Tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the International AAAI Conference on Weblogs and Social Media, Washington, DC (2010)
Google Scholar
Scholtus, S., Delden, A.v.: On the accuracy of estimators based on a binary classifier, Discussion Paper No. 202007, Statistics Netherlands, The Hague (2020)
Google Scholar
Schwartz, J.E.: The neglected problem of measurement error in categorical data. Soc. Methods Res. 13(4), 435–466 (1985). https://doi.org/10.1177/0049124185013004001
Article Google Scholar
Strichartz, R.S.: The Way of Analysis. Jones & Bartlett Learning, Sudbury (2000)
Google Scholar
Delden, A.v., Scholtus, S., Burger, J.: Accuracy of mixed-source statistics as affected by classification errors. J. Official Stat. 32(3), 619–642 (2016). https://doi.org/10.1515/jos-2016-0032
Wiedemann, G.: Proportional classification revisited: automatic content analysis of political manifestos using active learning. Soc. Sci. Comput. Rev. 37(2), 135–159 (2019). https://doi.org/10.1177/0894439318758389
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Mathematical Institute, Leiden University, Leiden, The Netherlands
Kevin Kloos
Institute of Psychology, Leiden University, Leiden, The Netherlands
Julian Karch
Leiden Centre of Data Science, Leiden University, Leiden, The Netherlands
Quinten Meertens
Center for Nonlinear Dynamics in Economics and Finance, University of Amsterdam, Amsterdam, The Netherlands
Quinten Meertens
Statistics Netherlands, The Hague, The Netherlands
Kevin Kloos, Quinten Meertens & Sander Scholtus

Authors

Kevin Kloos
View author publications
You can also search for this author in PubMed Google Scholar
Quinten Meertens
View author publications
You can also search for this author in PubMed Google Scholar
Sander Scholtus
View author publications
You can also search for this author in PubMed Google Scholar
Julian Karch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kevin Kloos or Quinten Meertens .

Editor information

Editors and Affiliations

LIACS, Leiden University, Leiden, The Netherlands
Mitra Baratchi
LIACS, Leiden University, Leiden, The Netherlands
Lu Cao
LIACS, Leiden University, Leiden, The Netherlands
Walter A. Kosters
Ghent University, Ghent, Belgium
Jefrey Lijffijt
LIACS, Leiden University, Leiden, The Netherlands
Jan N. van Rijn
LIACS, Leiden University, Leiden, The Netherlands
Frank W. Takes

Appendix

This appendix contains the proofs of the theorems presented in the paper entitled “Comparing Correction Methods to Reduce Misclassification Bias”. Recall that we have assumed a population of size N in which a fraction $\alpha := N_{1+} / N$ belongs to the class of interest, referred to as the class labelled as 1. We assume that a binary classification algorithm has been trained that correctly classifies a data point that belongs to class $i \in \{0,1\}$ with probability $p_{ii} > 0.5$, independently across all data points. In addition, we assume that a test set of size $n \ll N$ is available and that it can be considered a simple random sample from the population. The classification probabilities $p_{00}$ and $p_{11}$ are estimated on that test set as described in Sect. 2. Finally, we assume that the classify-and-count estimator $\hat{\alpha }^{*}$ is distributed independently of $\hat{p}_{00}$ and $\hat{p}_{11}$, which is reasonable (at least as an approximation) when $n \ll N$.

It may be noted that the estimated probabilities $\hat{p}_{11}$ and $\hat{p}_{00}$ defined in Sect. 2 cannot be computed if $n_{1+} = 0$ or $n_{0+} = 0$. Similarly, the calibration probabilities $c_{11}$ and $c_{00}$ cannot be estimated if $n_{+1} = 0$ or $n_{+0} = 0$. We assume here that these events occur with negligible probability. This will be true when n is sufficiently large so that $n \alpha \gg 1$ and $n (1 - \alpha ) \gg 1$.

Preliminaries

Many of the proofs presented in this appendix rely on the following two mathematical results. First, we will use univariate and bivariate Taylor series to approximate the expectation of non-linear functions of random variables. That is, to estimate E[f(X)] and E[g(X, Y)] for sufficiently differentiable functions f and g, we will insert the Taylor series for f and g at $x_0 = E[X]$ and $y_0 = E[Y]$ up to terms of order 2 and utilize the linearity of the expectation. Second, we will use the following conditional variance decomposition for the variance of a random variable X:

$$\begin{aligned} V(X) = E[V(X \mid Y)] + V(E[X \mid Y]). \end{aligned}$$

(19)

The conditional variance decomposition follows from the tower property of conditional expectations [10]. Before we prove the theorems presented in the paper, we begin by proving the following lemma.

Lemma 1

The variance of the estimator $\hat{p}_{11}$ for $p_{11}$ estimated on the test set is given by

$$\begin{aligned} V(\hat{p}_{11}) = \frac{p_{11}(1-p_{11})}{n \alpha } \left[ 1 + \frac{1 - \alpha }{n\alpha } \right] + O\left( \frac{1}{n^3}\right) . \end{aligned}$$

(20)

Similarly, the variance of $\hat{p}_{00}$ is given by

$$\begin{aligned} V(\hat{p}_{00}) = \frac{p_{00}(1-p_{00})}{n (1-\alpha )} \left[ 1 + \frac{\alpha }{n(1-\alpha )} \right] + O\left( \frac{1}{n^3}\right) . \end{aligned}$$

(21)

Moreover, $\hat{p}_{11}$ and $\hat{p}_{00}$ are uncorrelated: $C(\hat{p}_{11},\hat{p}_{00}) = 0$.

Proof

(of Lemma 1). We approximate the variance of $\hat{p}_{00}$ using the conditional variance decomposition and a second-order Taylor series, as follows:

$$\begin{aligned} V(\hat{p}_{00})&= V\left( \frac{n_{00}}{n_{0+}}\right) \\&= E_{n_{0+}} \left[ V\left( \frac{n_{00}}{n_{0+}}\mid n_{0+}\right) \right] + V_{n_{0+}} \left[ E\left( \frac{n_{00}}{n_{0+}}\mid n_{0+}\right) \right] \\&= E_{n_{0+}} \left[ \frac{1}{n_{0+}^2}V(n_{00}\mid n_{0+}) \right] + V_{n_{0+}} \left[ \frac{1}{n_{0+}} E(n_{00}\mid n_{0+}) \right] \\&= E_{n_{0+}} \left[ \frac{n_{0+} p_{00}(1-p_{00})}{n_{0+}^2} \right] + V_{n_{0+}} \left[ \frac{n_{0+} p_{00}}{n_{0+}} \right] \\&=E_{n_{0+}} \left[ \frac{1}{n_{0+}} \right] p_{00}(1-p_{00}) \\&= \left[ \frac{1}{E[n_{0+}]} + \frac{1}{2}\frac{2}{E[n_{0+}]^3} \times V[n_{0+}] \right] p_{00}(1-p_{00}) + O\left( \frac{1}{n^3}\right) \\&= \frac{p_{00}(1-p_{00})}{E[n_{0+}]} \left[ 1 + \frac{V[n_{0+}]}{E[n_{0+}]^2} \right] + O\left( \frac{1}{n^3}\right) \\&= \frac{p_{00}(1-p_{00})}{n (1-\alpha )} \left[ 1 + \frac{\alpha }{n(1-\alpha )} \right] + O\left( \frac{1}{n^3}\right) . \end{aligned}$$

The variance of $\hat{p}_{11}$ is approximated in the exact same way.

Finally, to evaluate $C(\hat{p}_{11},\hat{p}_{00})$ we use the analogue of Eq. (19) for covariances:

$$\begin{aligned} C(\hat{p}_{11},\hat{p}_{00})&= C\left( \frac{n_{11}}{n_{1+}},\frac{n_{00}}{n_{0+}}\right) \\&= E_{n_{1+},n_{0+}} \left[ C\left( \frac{n_{11}}{n_{1+}},\frac{n_{00}}{n_{0+}}\mid n_{1+},n_{0+}\right) \right] \\&\qquad + C_{n_{1+},n_{0+}} \left[ E\left( \frac{n_{11}}{n_{1+}}\mid n_{1+},n_{0+}\right) , E\left( \frac{n_{00}}{n_{0+}}\mid n_{1+},n_{0+}\right) \right] \\&= E_{n_{1+},n_{0+}} \left[ \frac{1}{n_{1+}n_{0+}}C(n_{11},n_{00}\mid n_{1+},n_{0+}) \right] \\&\qquad + C_{n_{1+},n_{0+}} \left[ \frac{1}{n_{1+}} E(n_{11}\mid n_{1+}), \frac{1}{n_{0+}} E(n_{00}\mid n_{0+}) \right] . \end{aligned}$$

The second term is zero as before. The first term also vanishes because, conditional on the row totals $n_{1+}$ and $n_{0+}$, the counts $n_{11}$ and $n_{00}$ follow independent binomial distributions, so $C(n_{11},n_{00}\mid n_{1+},n_{0+}) = 0$.

Note: in the remainder of this appendix, we will not add explicit subscripts to expectations and variances when their meaning is unambiguous.

Subtracted-Bias Estimator

We will now prove the bias and variance approximations for the subtracted-bias estimator $\hat{\alpha }_b$ that was defined in Eq. (9).

Proof

(of Theorem 1). The bias of $\hat{\alpha }_b$ is given by

$$\begin{aligned} B(\hat{\alpha }_{b})&= E\left[ \hat{\alpha }^{\star } - \hat{B}[\hat{\alpha }^{\star }]\right] - \alpha \\&= E[\hat{\alpha }^{\star } - \alpha ] - E\left[ \hat{B}[\hat{\alpha }^{\star }]\right] \\&= B[\hat{\alpha }^{\star }] - E\left[ \hat{B}[\hat{\alpha }^{\star }]\right] \\&= \left[ \alpha (p_{00}+p_{11} - 2) + (1 - p_{00})\right] - E\left[ \hat{\alpha }^{\star }(\hat{p}_{00}+ \hat{p}_{11} - 2) + (1 - \hat{p}_{00})\right] . \end{aligned}$$

Because $\hat{\alpha }^*$ and ($\hat{p}_{00} + \hat{p}_{11} - 2)$ are assumed to be independent, the expectation of their product equals the product of their expectations:

$$\begin{aligned} B(\hat{\alpha }_{b})&= \alpha (p_{00}+p_{11} - 2) + (1 - p_{00}) - E[\hat{\alpha }^{\star }](p_{00}+p_{11} - 2) - (1 - p_{00}) \\&= (\alpha - E[\hat{\alpha }^{\star }])(p_{00} + p_{11} - 2) \\&= B[\hat{\alpha }^{\star }](2 - p_{00} - p_{11}) \\&= (1 - p_{00})(2-p_{00}-p_{11}) - \alpha (p_{00}+p_{11}-2)^2. \end{aligned}$$

This proves the formula for the bias of $\hat{\alpha }_b$ as estimator for $\alpha $. To approximate the variance of $\hat{\alpha }_b$, we apply the conditional variance decomposition of Eq. (19) conditional on $\hat{\alpha }^*$ and look at the two resulting terms separately. First, consider the expectation of the conditional variance:

$$\begin{aligned} E \left[ V(\hat{\alpha }_b \mid \hat{\alpha }^*) \right]&= E \left[ V(\hat{\alpha }^*(3-\hat{p}_{00} - \hat{p}_{11}) - (1- \hat{p}_{00}) \mid \hat{\alpha }^*) \right] \\&= E \big [ V(\hat{\alpha }^*(3-\hat{p}_{00} - \hat{p}_{11}) \mid \hat{\alpha }^*) + V(1- \hat{p}_{00} \mid \hat{\alpha }^*) \\&\qquad - 2C(\hat{\alpha }^*(3-\hat{p}_{00} - \hat{p}_{11}), 1- \hat{p}_{00} \mid \hat{\alpha }^*) \big ] \\&= E \big [ (\hat{\alpha }^*)^2\,V(3 - \hat{p}_{00} - \hat{p}_{11} \mid \hat{\alpha }^*) + V(1 - \hat{p}_{00} \mid \hat{\alpha }^*) \\&\qquad - 2 \hat{\alpha }^* C(3-\hat{p}_{00} - \hat{p}_{11}, 1- \hat{p}_{00} \mid \hat{\alpha }^*) \big ] \\&= E \big [ (\hat{\alpha }^*)^2 \left[ V(\hat{p}_{00}) + V (\hat{p}_{11})\right] + V(\hat{p}_{00}) - 2 \hat{\alpha }^* V(\hat{p}_{00}) \big ] \\&= E \left[ (\hat{\alpha }^*)^2 \right] \left[ V(\hat{p}_{00}) + V (\hat{p}_{11})\right] + V (\hat{p}_{00}) - 2 E \left[ \hat{\alpha }^* \right] V(\hat{p}_{00}). \end{aligned}$$

In the penultimate line, we used that $C (\hat{p}_{11},\hat{p}_{00}) = 0$. The second moment $E \left[ (\hat{\alpha }^*)^2 \right] $ can be written as $E \left[ \hat{\alpha }^*\right] ^2 + V(\hat{\alpha }^*)$. Because $V(\hat{\alpha }^*)$ is of order 1/N, it can be neglected compared to $E \left[ \hat{\alpha }^* \right] ^2$, which is of order 1. In particular, we find that the expectation of the conditional variance equals:

$$\begin{aligned} E \left[ V(\hat{\alpha }_b \mid \hat{\alpha }^*) \right]&= E \left[ (\hat{\alpha }^*) \right] ^2 \left[ V(\hat{p}_{00}) + V (\hat{p}_{11})\right] + V (\hat{p}_{00}) - 2 E \left[ \hat{\alpha }^* \right] V(\hat{p}_{00}) + O\left( \frac{1}{N}\right) \\&= V(\hat{p}_{00}) \left[ E \left[ \hat{\alpha }^* \right] - 1 \right] ^2 + V (\hat{p}_{11}) E \left[ \hat{\alpha }^* \right] ^2 + O\left( \frac{1}{N}\right) . \end{aligned}$$

Next, the variance of the conditional expectation can be seen to be equal the following:

$$\begin{aligned} V \left[ E(\hat{\alpha }_b \mid \hat{\alpha }^*) \right]&= V \left[ E(\hat{\alpha }^*(3-\hat{p}_{00} - \hat{p}_{11}) - (1- \hat{p}_{00}) \mid \hat{\alpha }^*)\right] \\&= V \left[ \hat{\alpha }^* E(3-\hat{p}_{00} - \hat{p}_{11} \mid \hat{\alpha }^*) - E(1- \hat{p}_{00} \mid \hat{\alpha }^*) \right] \\&= V(\hat{\alpha }^*) (3 - p_{00} - p_{11})^2. \end{aligned}$$

Because $V(\hat{\alpha }^*)$ is of order 1/N, it can be neglected in the final formula. Furthermore, the variances of $\hat{p}_{00}$ and $\hat{p}_{11}$ can be written out using the result from Lemma 1:

$$\begin{aligned} V(\hat{\alpha }_b)&= \frac{\left[ \alpha (p_{00}+p_{11}-1) - p_{00}\right] ^2 p_{00}(1-p_{00})}{n(1-\alpha )} \left[ 1 + \frac{\alpha }{n(1-\alpha )} \right] \\&\qquad + \frac{ \left[ \alpha (p_{00}+p_{11}-1) + (1 - p_{00}) \right] ^2 p_{11}(1-p_{11})}{n \alpha } \left[ 1 + \frac{1 - \alpha }{n \alpha } \right] \\&\qquad + O\left( \max \left[ \frac{1}{n^3}, \frac{1}{N} \right] \right) . \end{aligned}$$

This concludes the proof of Theorem 1.

Misclassification Estimator

We will now prove the bias and variance approximations for the misclassification estimator $\hat{\alpha }_p$ as defined in Eq. (12).

Proof

(of Theorem 2). Under the assumption that $\hat{\alpha }^{*}$ is distributed independently of $(\hat{p}_{00}, \hat{p}_{11})$, it holds that

$$\begin{aligned} E(\hat{\alpha }_p)&= E \left( \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right) + E \left[ E \left( \left. \frac{\hat{\alpha }^{*}}{\hat{p}_{00} + \hat{p}_{11} - 1} \, \right| \, \hat{\alpha }^{*} \right) \right] \nonumber \\&= E \left( \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right) + E (\hat{\alpha }^{*}) E \left( \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right) . \end{aligned}$$

(22)

$E(\hat{\alpha }^{*})$ is known from Eq. (4). To evaluate the other two expectations, we use a second-order Taylor series approximation. The first- and second-order partial derivatives of $f(x, y) = 1/(x + y - 1)$ and $g(x, y) = (x - 1)/(x + y - 1) = 1 - [y/(x + y - 1)]$ are given by:

$$\begin{aligned} \frac{\partial f}{\partial x}&= \frac{\partial f}{\partial y} = \frac{-1}{(x+y-1)^2}, \end{aligned}$$

(23)

$$\begin{aligned} \frac{\partial ^2 f}{\partial x^2}&= \frac{\partial ^2 f}{\partial y^2} = \frac{2}{(x+y-1)^3}, \nonumber \\ \frac{\partial g}{\partial x}&= \frac{y}{(x+y-1)^2}, \end{aligned}$$

(24)

$$\begin{aligned} \frac{\partial g}{\partial y}&= \frac{-(x-1)}{(x+y-1)^2}, \nonumber \\ \frac{\partial ^2g}{\partial x^2}&= \frac{-2y}{(x+y-1)^3}, \nonumber \\ \frac{\partial ^2g}{\partial y^2}&= \frac{2(x-1)}{(x+y-1)^3}. \end{aligned}$$

(25)

Now also using that $C(\hat{p}_{11},\hat{p}_{00}) = 0$, we obtain for the first expectation:

$$\begin{aligned} E \left( \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right)&= \frac{1}{p_{00} + p_{11} - 1} + \frac{V(\hat{p}_{00}) + V(\hat{p}_{11})}{(p_{00} + p_{11} - 1)^3} + O ( n^{-2} ) \nonumber \\&= \frac{1}{p_{00} + p_{11} - 1} \left[ 1 + \frac{\frac{p_{00}(1-p_{00})}{n(1-\alpha )} + \frac{p_{11}(1-p_{11})}{n\alpha }}{(p_{00} + p_{11} - 1)^2} \right] + O ( n^{-2} ). \end{aligned}$$

(26)

Here, we have included only the first term of the approximations to $V(\hat{p}_{00})$ and $V(\hat{p}_{11})$ from Lemma 1, since this suffices to approximate the bias up to terms of order O(1/n). Similarly, for the second expectation we obtain:

$$\begin{aligned} E \left( \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right)&= \frac{p_{00} - 1}{p_{00} + p_{11} - 1} + \frac{(p_{00} - 1) V(\hat{p}_{11}) - p_{11} V(\hat{p}_{00})}{(p_{00} + p_{11} - 1)^3} + O ( n^{-2} ) \nonumber \\&= \frac{p_{00} - 1}{p_{00} + p_{11} - 1} \left[ 1 + p_{11} \frac{\frac{1-p_{11}}{n\alpha } + \frac{p_{00}}{n(1-\alpha )}}{(p_{00} + p_{11} - 1)^2} \right] + O ( n^{-2} ). \end{aligned}$$

(27)

Using Eqs. (22), (4), (26) and (27), we conclude that:

$$\begin{aligned} E(\hat{\alpha }_p)&= \frac{\alpha (p_{00} + p_{11} - 1) - (p_{00} - 1)}{p_{00} + p_{11} - 1} \left[ 1 + \frac{\frac{p_{00}(1-p_{00})}{n(1-\alpha )} + \frac{p_{11}(1-p_{11})}{n\alpha }}{(p_{00} + p_{11} - 1)^2} \right] \\&\qquad + \frac{p_{00} - 1}{p_{00} + p_{11} - 1} \left[ 1 + p_{11} \frac{\frac{1-p_{11}}{n\alpha } + \frac{p_{00}}{n(1-\alpha )}}{(p_{00} + p_{11} - 1)^2} \right] + O \left( \frac{1}{n^2} \right) . \end{aligned}$$

From this, it follows that an approximation to the bias of $\hat{\alpha }_p$ that is correct up to terms of order O(1/n) is given by:

$$\begin{aligned} B(\hat{\alpha }_p)&= \frac{\alpha (p_{00} + p_{11} - 1) - (p_{00} - 1)}{n(p_{00} + p_{11} - 1)^3} \left[ \frac{p_{00}(1-p_{00})}{1-\alpha } + \frac{p_{11}(1-p_{11})}{\alpha } \right] \\&\qquad + \frac{(p_{00} - 1)p_{11}}{n(p_{00} + p_{11} - 1)^3} \left[ \frac{1-p_{11}}{\alpha } + \frac{p_{00}}{1-\alpha } \right] + O \left( \frac{1}{n^2} \right) . \end{aligned}$$

By expanding the products in this expression and combining similar terms, the expression can be simplified to:

$$\begin{aligned} B(\hat{\alpha }_p) = \frac{p_{11}(1-p_{11}) - p_{00}(1-p_{00})}{n(p_{00} + p_{11} - 1)^2} + O \left( \frac{1}{n^2} \right) . \end{aligned}$$

Finally, using the identity $p_{11}(1-p_{11}) - p_{00}(1-p_{00}) = (p_{00}+p_{11}-1)(p_{00}-p_{11})$, we obtain the required result for $B(\hat{\alpha }_p)$.

To approximate the variance of $\hat{\alpha }_p$, we apply the conditional variance decomposition conditional on $\hat{\alpha }^*$ and look at the two resulting terms separately. First, consider the variance of the conditional expectation:

$$\begin{aligned} V\left[ E(\hat{\alpha }_p \mid \hat{\alpha }^{*} ) \right]&= V \left[ E \left( \hat{\alpha }^{*} \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} + \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \mid \hat{\alpha }^{*} \right) \right] \nonumber \\&= V \left[ \hat{\alpha }^{*} \frac{1}{p_{00} + p_{11} - 1} \right] \nonumber \\&= \frac{1}{(p_{00} + p_{11} - 1)^2} V \left[ \hat{\alpha }^{*} \right] = O\left( \frac{1}{N}\right) , \end{aligned}$$

(28)

where in the last line we used Eq. (6). Note: the factor $1/(p_{00} + p_{11} - 1)^2$ can become arbitrarily large in the limit $p_{00} + p_{11} \rightarrow 1$. It will be seen below that this same factor also occurs in the lower-order terms of $V(\hat{\alpha }_p)$; hence, the relative contribution of Eq. (28) remains negligible even in the limit $p_{00} + p_{11} \rightarrow 1$.

Next, we compute the expectation of the conditional variance.

$$\begin{aligned} E \left[ V(\hat{\alpha }_p \mid \hat{\alpha }^{*}) \right]&= E \left[ V \left( \hat{\alpha }^{*} \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} + \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \mid \hat{\alpha }^{\star } \right) \right] \nonumber \\&= E \bigg [ V \left( \hat{\alpha }^{*} \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} \mid \alpha ^{\star } \right) + V \left( \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \mid \hat{\alpha }^{\star } \right) \nonumber \\&\qquad + 2C \left( \hat{\alpha }^{*} \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1}, \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \mid \hat{\alpha }^{\star } \right) \bigg ] \nonumber \\&= E \left[ (\hat{\alpha }^{*})^2 \right] V \left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] + V \left[ \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] \nonumber \\&\qquad + 2 E \left[ \hat{\alpha }^{\star } \right] C \left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1}, \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] \nonumber \\&= E \left[ \hat{\alpha }^{\star } \right] ^2 \left[ 1 + O\left( \frac{1}{N}\right) \right] V \left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] + V \left[ \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] \nonumber \\&\qquad + 2 E \left[ \hat{\alpha }^{\star } \right] C \left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1}, \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right] . \end{aligned}$$

(29)

To approximate the variance and covariance terms, we use a first-order Taylor series. Using the partial derivatives in Eqs. (23), (24) and (25), we obtain:

$$\begin{aligned} V\left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1}\right]&= \frac{V(\hat{p}_{00}) + V(\hat{p}_{11})}{(p_{00} + p_{11} - 1)^4} + O(n^{-2}) \\ V\left[ \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right]&= \frac{V(\hat{p}_{00}) ( p_{11})^2}{( p_{00} + p_{11} - 1)^4} + \frac{V(\hat{p}_{11}) (1 - p_{00})^2}{( p_{00} + p_{11} - 1)^4} + O(n^{-2}) \\ C \left[ \frac{1}{\hat{p}_{00} + \hat{p}_{11} - 1}, \frac{\hat{p}_{00} - 1}{\hat{p}_{00} + \hat{p}_{11} - 1} \right]&= \frac{V(\hat{p}_{00}) (-p_{11})}{( p_{00} + p_{11} - 1)^4} + \frac{ V(\hat{p}_{11}) (p_{00} - 1)}{( p_{00} + p_{11} - 1)^4} + O(n^{-2}). \end{aligned}$$

Substituting these terms into Eq. (29) and accounting for Eq. (28) yields:

$$\begin{aligned} V(\hat{\alpha }_p)&= \frac{V(\hat{p}_{00}) \left[ E \left[ \hat{\alpha }^{\star } \right] ^2 - 2 p_{11} E \left[ \hat{\alpha }^{\star } \right] + p_{11}^2 \right] }{( p_{00} + p_{11} - 1)^4} \\&\qquad + \frac{V(\hat{p}_{11}) \left[ E \left[ \hat{\alpha }^{\star } \right] ^2 - 2(1 - p_{00}) E \left[ \hat{\alpha }^{\star } \right] + (1 - p_{00})^2 \right] }{( p_{00} + p_{11} - 1)^4} + O\left( \max \left[ \frac{1}{n^2}, \frac{1}{N} \right] \right) \\&= \frac{V(\hat{p}_{00}) \left[ E \left[ \hat{\alpha }^{\star } \right] - p_{11} \right] ^2}{( p_{00} + p_{11} - 1)^4} + \frac{V(\hat{p}_{11}) \left[ E \left[ \hat{\alpha }^{\star } \right] - (1 - p_{00}) \right] ^2}{( p_{00} + p_{11} - 1)^4} + O\left( \max \left[ \frac{1}{n^2}, \frac{1}{N} \right] \right) \\&= \frac{V(\hat{p}_{00}) (1-\alpha )^2}{( p_{00} + p_{11} - 1)^2} + \frac{V(\hat{p}_{11}) \alpha ^2}{(p_{00} + p_{11} - 1)^2} + O\left( \max \left[ \frac{1}{n^2}, \frac{1}{N} \right] \right) . \end{aligned}$$

Finally, inserting the expressions for $V(\hat{p}_{00})$ and $V(\hat{p}_{11})$ from Lemma 1 yields:

$$\begin{aligned} V(\hat{\alpha }_p)&= \frac{\frac{p_{00}(1-p_{00})}{n (1-\alpha )} \left[ 1 + \frac{\alpha }{n(1 - \alpha )} \right] (1-\alpha )^2}{( p_{00} + p_{11} - 1)^2} + \frac{\frac{p_{11}(1-p_{11})}{n \alpha } \left[ 1 + \frac{1 - \alpha }{n\alpha } \right] \alpha ^2}{(p_{00} + p_{11} - 1)^2} \\&\qquad + O\left( \max \left[ \frac{1}{n^2}, \frac{1}{N} \right] \right) , \end{aligned}$$

from which Eq. (14) follows. This concludes the proof of Theorem 2.

Calibration Estimator

We will now prove the bias and variance approximations for the calibration estimator $\hat{\alpha }_c$ that was defined in Eq. (15).

Proof

(of Theorem 3). To compute the expected value of $\hat{\alpha }_c$, we first compute its expectation conditional on the 4-vector $\boldsymbol{N} = (N_{00},N_{01},N_{10},N_{11})$:

$$\begin{aligned} E(\hat{\alpha }_c \mid \boldsymbol{N})&= E \left[ \hat{\alpha }^* \frac{n_{11}}{n_{+1}} + (1-\hat{\alpha }^*) \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right] \nonumber \\&= \hat{\alpha }^* E \left[ \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N} \right] + (1-\hat{\alpha }^*) E \left[ \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right] \nonumber \\&= \hat{\alpha }^* E \left[ E \left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+1} \right) \mid \boldsymbol{N} \right] \nonumber \\ {}&\qquad + (1-\hat{\alpha }^*) E \left[ E \left( \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N}, n_{+0} \right) \mid \boldsymbol{N} \right] \nonumber \\&= \frac{N_{+1}}{N} E \left[ \frac{1}{n_{+1}} n_{+1} \frac{N_{11}}{N_{+1}} \mid \boldsymbol{N} \right] + \frac{N_{+0}}{N} E \left[ \frac{1}{n_{+0}} n_{+0} \frac{N_{10}}{N_{+0}} \mid \boldsymbol{N} \right] \nonumber \\&= \frac{N_{11}}{N} + \frac{N_{10}}{N} \nonumber \\&= \frac{N_{1+}}{N} = \alpha . \end{aligned}$$

(30)

By the tower property of conditional expectations, it follows that $E[\hat{\alpha }_c] = E \left[ E(\hat{\alpha }_c \mid \boldsymbol{N}) \right] = \alpha $. This proves that $\hat{\alpha }_c$ is an unbiased estimator for $\alpha $.

To compute the variance of $\hat{\alpha }_c$, we use the conditional variance decomposition, again conditioning on the 4-vector $\boldsymbol{N}$. We remark that $N_{0+}$ and $N_{1+}$ are deterministic values, but that $N_{+0}$ and $N_{+1}$ are random variables. As shown above in Eq. (30), the conditional expectation is deterministic, hence it has no variance: $V(E[\hat{\alpha }_c \mid \boldsymbol{N}]) = 0$. The conditional variance decomposition then simplifies to the following:

$$\begin{aligned} V(\hat{\alpha }_c) = E \left[ V(\hat{\alpha }_c \mid \boldsymbol{N}) \right] . \end{aligned}$$

(31)

The conditional variance $V(\hat{\alpha }_c \mid \boldsymbol{N})$ can be written as follows:

$$\begin{aligned} V[\hat{\alpha }_c \mid \boldsymbol{N}]&= V \left[ \hat{\alpha }^* \frac{n_{11}}{n_{+1}} + (1-\hat{\alpha }^*) \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right] \nonumber \\&= (\hat{\alpha }^*)^2\,V \left[ \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}\right] + (1 - \hat{\alpha }^*)^2\,V \left[ \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N}\right] \nonumber \\&\qquad + 2\hat{\alpha }^*(1-\hat{\alpha }^*) C \left[ \frac{n_{11}}{n_{+1}}, \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right] . \end{aligned}$$

(32)

We will consider these terms separately. First, the variance of $n_{11}/n_{+1}$ can be computed by applying an additional conditional variance decomposition:

$$\begin{aligned} V\left[ \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N} \right] = V \left[ E\left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+1}\right) \mid \boldsymbol{N} \right] + E \left[ V\left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+1}\right) \mid \boldsymbol{N} \right] . \end{aligned}$$

The first term is zero, which can be shown as follows:

$$\begin{aligned} V \left[ E\left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+1}\right) \right]&= V \left[ \frac{1}{n_{+1}} E(n_{11} \mid \boldsymbol{N}, n_{+1}) \mid \boldsymbol{N} \right] \\&= V \left[ \frac{1}{n_{+1}} n_{+1} \frac{N_{11}}{N_{+1}} \mid \boldsymbol{N} \right] \\&= V \left[ \frac{N_{11}}{N_{+1}} \mid \boldsymbol{N} \right] = 0. \end{aligned}$$

For the second term, we find under the assumption that $n \ll N$:

$$\begin{aligned} E \left[ V\left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+1}\right) \mid \boldsymbol{N} \right]&= E \left[ \frac{1}{n_{+1}^2} V(n_{11} \mid \boldsymbol{N}, n_{+1}) \mid \boldsymbol{N} \right] \\&= E \left[ \frac{1}{n_{+1}^2} n_{+1} \frac{N_{11}}{N_{+1}} (1 - \frac{N_{11}}{N_{+1}}) \mid \boldsymbol{N} \right] \\&= E \left[ \frac{1}{n_{+1}} \mid \boldsymbol{N} \right] \frac{N_{11}N_{01}}{N_{+1}^2}. \end{aligned}$$

The expectation of $\frac{1}{n_{+1}}$ can be approximated with a second-order Taylor series:

$$\begin{aligned} V\left[ \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N} \right]&= \left[ \frac{1}{E[n_{+1} \mid \boldsymbol{N}]} + \frac{1}{2} \frac{2}{E[n_{+1}\mid \boldsymbol{N}]^3} V \left[ n_{+1} \mid \boldsymbol{N} \right] \right] \frac{N_{11}N_{01}}{N_{+1}^2} + O(n^{-3}) \nonumber \\&= \frac{1}{E[n_{+1} \mid \boldsymbol{N}]} \left[ 1 + \frac{V \left[ n_{+1} \mid \boldsymbol{N} \right] }{E[n_{+1} \mid \boldsymbol{N}]^2} \right] \frac{N_{11}N_{01}}{N_{+1}^2} + O(n^{-3}) \nonumber \\&= \frac{1}{n\hat{\alpha }^*} \left[ 1 + \frac{1 - \hat{\alpha }^*}{n \hat{\alpha }^*} \right] \frac{N_{11}N_{01}}{N_{+1}^2} + O(n^{-3}). \end{aligned}$$

(33)

The variance of $n_{10}/n_{+0}$ can be approximated in the same way, which yields the following expression:

$$\begin{aligned} V\left[ \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right]&= \frac{1}{n(1-\hat{\alpha }^*)} \left[ 1 + \frac{\hat{\alpha }^*}{n (1-\hat{\alpha }^*)} \right] \frac{N_{00}N_{10}}{N_{+0}^2} + O(n^{-3}). \end{aligned}$$

(34)

Finally, it can be shown that the covariance in the final term is equal to zero:

$$\begin{aligned} C \left[ \frac{n_{11}}{n_{+1}}, \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N} \right]&= E \left[ C \left( \frac{n_{11}}{n_{+1}}, \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) \mid \boldsymbol{N} \right] \nonumber \\&\qquad + C \left[ E \left( \frac{n_{11}}{n_{+1}} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) , E \left( \frac{n_{10}}{n_{+0}} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) \mid \boldsymbol{N} \right] \nonumber \\&= E \left[ \frac{1}{n_{+0} n_{+1}} C \left( n_{11},n_{10} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) \mid \boldsymbol{N} \right] \nonumber \\&\qquad + C \left[ \frac{1}{n_{+1}} E \left( n_{11} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) ,\frac{1}{n_{+0}} E \left( n_{10} \mid \boldsymbol{N}, n_{+0}, n_{+1} \right) \mid \boldsymbol{N} \right] \nonumber \\&= 0 + C \left[ \frac{1}{n_{+1}} n_{+1} \frac{N_{11}}{N_{+1}} ,\frac{1}{n_{+0}} n_{+0} \frac{N_{10}}{N_{+0}} \mid \boldsymbol{N} \right] = 0. \end{aligned}$$

(35)

Combining Eqs. (33), (34), (35), with Eq. (32) gives:

$$\begin{aligned} V[\hat{\alpha }_c \mid \boldsymbol{N}]&= \frac{N_{+1}^2}{N^2} \frac{1}{n\hat{\alpha }^*} \left[ 1 + \frac{1 - \hat{\alpha }^*}{n \hat{\alpha }^*} \right] \frac{N_{11}N_{01}}{N_{+1}^2} \\&\qquad + \frac{N_{+0}^2}{N^2} \frac{1}{n(1-\hat{\alpha }^*)} \left[ 1 + \frac{\hat{\alpha }^*}{n (1-\hat{\alpha }^*)} \right] \frac{N_{00}N_{10}}{N_{+0}^2} + O(n^{-3}) \\&= \frac{1}{n\hat{\alpha }^*} \left[ 1 + \frac{1 - \hat{\alpha }^*}{n \hat{\alpha }^*} \right] \frac{N_{11}N_{01}}{N^2} \\&\qquad + \frac{1}{n(1-\hat{\alpha }^*)} \left[ 1 + \frac{\hat{\alpha }^*}{n (1-\hat{\alpha }^*)} \right] \frac{N_{00}N_{10}}{N^2} + O(n^{-3}). \end{aligned}$$

Recall from Eq. (31) that V $\left[ \hat{\alpha }_c \right] $ = E $\left[ V[\hat{\alpha }_c \mid \boldsymbol{N}]\right] $ = E $ \left[ \text {E} \left[ V[\hat{\alpha }_c \mid \boldsymbol{N}] \mid N_{+1} \right] \right] $. Hence,

$$\begin{aligned} V[\hat{\alpha }_c]&= E \left[ \frac{1}{n\hat{\alpha }^*} \left( 1 + \frac{1 - \hat{\alpha }^*}{n \hat{\alpha }^*} \right) E \left( \frac{N_{11}N_{01}}{N^2} \mid N_{+1} \right) \right. \\&\qquad \left. + \frac{1}{n(1-\hat{\alpha }^*)} \left( 1 + \frac{\hat{\alpha }^*}{n (1-\hat{\alpha }^*)} \right) E \left( \frac{N_{00}N_{10}}{N^2} \mid N_{+1} \right) \right] + O(n^{-3}). \nonumber \end{aligned}$$

(36)

To evaluate the expectations in this expression, we observe that, conditional on the column total $N_{+1}$, $N_{11}$ is distributed as $Bin(N_{+1}, c_{11})$, where $c_{11}$ is a calibration probability as defined in Section 2.5. Hence,

$$\begin{aligned} E \left[ N_{11} \mid N_{+1} \right]&= N_{+1} c_{11} = \frac{N_{+1} \alpha p_{11}}{(1 - \alpha )(1 - p_{00}) + \alpha p_{11}} \\ V \left[ N_{11} \mid N_{+1} \right]&= N_{+1} c_{11} (1 - c_{11}). \nonumber \end{aligned}$$

(37)

Similarly, since $N = N_{+1} + N_{+0}$ is fixed,

$$\begin{aligned} E \left[ N_{00} \mid N_{+1} \right]&= N_{+0} c_{00} = \frac{N_{+0} (1 - \alpha )p_{00}}{(1 - \alpha )p_{00} + \alpha (1 - p_{11})} \\ V \left[ N_{00} \mid N_{+1} \right]&= N_{+0} c_{00} (1 - c_{00}). \nonumber \end{aligned}$$

(38)

Using these results, we obtain:

$$\begin{aligned} E \left[ \frac{N_{11}N_{01}}{N^2} \mid N_{+1} \right]&= \frac{1}{N^2} E \left[ N_{11} N_{01} \mid N_{+1} \right] \nonumber \\&= \frac{1}{N^2} E \left[ N_{11} (N_{+1} - N_{11}) \mid N_{+1} \right] \nonumber \\&= \frac{1}{N^2} \left[ N_{+1} E \left[ N_{11} \mid N_{+1} \right] - E \left[ N_{11}^2 \mid N_{+1} \right] \right] \nonumber \\&= \frac{1}{N^2} \left[ N_{+1} E \left[ N_{11} \mid N_{+1} \right] - V \left[ N_{11} \mid N_{+1} \right] - E \left[ N_{11} \mid N_{+1} \right] ^2 \right] \nonumber \\&= \frac{1}{N^2} \left[ N_{+1}^{2} c_{11} - N_{+1}c_{11}(1-c_{11}) - N_{+1}^{2} c_{11}^2 \right] \nonumber \\&= \frac{N_{+1}^{2}}{N^2} c_{11} (1 - c_{11}) + O\left( \frac{1}{N} \right) , \end{aligned}$$

(39)

and similarly

$$\begin{aligned} E \left[ \frac{N_{00}N_{10}}{N^2} \mid N_{+1} \right] = \frac{N_{+0}^{2}}{N^2} c_{00} (1 - c_{00}) + O\left( \frac{1}{N} \right) . \end{aligned}$$

(40)

Substituting Eqs. (39) and (40) into Eq. (36) and noting that $N_{+1}^{2}/N^2 = (\hat{\alpha }^*)^2$ and $N_{+0}^{2}/N^2 = (1 - \hat{\alpha }^*)^2$, we obtain:

$$\begin{aligned} V[\hat{\alpha }_c]&= E \left[ \frac{\hat{\alpha }^*}{n} \left( 1 + \frac{1 - \hat{\alpha }^*}{n \hat{\alpha }^*} \right) c_{11} (1-c_{11}) \right. \\&\qquad \left. + \frac{1-\hat{\alpha }^*}{n} \left( 1 + \frac{\hat{\alpha }^*}{n (1-\hat{\alpha }^*)} \right) c_{00} (1-c_{00}) \right] + O\left( \max \left[ \frac{1}{n^3}, \frac{1}{Nn} \right] \right) \\&= \left[ \frac{E(\hat{\alpha }^*)}{n} + \frac{1 - E(\hat{\alpha }^*)}{n^2} \right] c_{11} (1-c_{11}) \\&\qquad + \left[ \frac{1-E(\hat{\alpha }^*)}{n} + \frac{E(\hat{\alpha }^*)}{n^2} \right] c_{00} (1-c_{00}) + O\left( \max \left[ \frac{1}{n^3}, \frac{1}{Nn} \right] \right) . \end{aligned}$$

Finally, substituting the expressions for $E(\hat{\alpha }^*)$ from Eq. (4) and the expressions for $c_{11}$ and $c_{00}$ from Eqs. (37) and (38), the desired Eq. (17) is obtained. This concludes the proof of Theorem 3.

Comparing Mean Squared Errors

To conclude, we present the proof of Theorem 4, which essentially shows that the mean squared error (up to and including terms of order 1/n) of the calibration estimator is lower than that of the misclassification estimator.

Proof

(of Theorem 4). Recall that the bias of $\hat{\alpha }_p$ as an estimator for $\alpha $ is given by

$$\begin{aligned} B \left[ \hat{\alpha }_p \right] = \frac{p_{00}-p_{11}}{n(p_{00}+p_{11}-1)} + O \left( \frac{1}{n^2} \right) . \end{aligned}$$

Hence, $(B \left[ \hat{\alpha }_p \right] )^2 = O(1/n^2)$ is not relevant for $\widetilde{MSE}[\hat{\alpha }_p]$. It follows that $\widetilde{MSE}[\hat{\alpha }_p]$ is equal to the variance of $\hat{\alpha }_p$ up to order 1/n. From Eq. (14) we obtain:

$$\begin{aligned} \widetilde{MSE}[\hat{\alpha }_p] = \frac{1}{n} \left[ \frac{(1-\alpha )p_{00}(1-p_{00}) + \alpha p_{11}(1-p_{11})}{(p_{00} + p_{11} - 1)^2} \right] . \end{aligned}$$

(41)

Recall that $\hat{\alpha }_c$ is an unbiased estimator for $\alpha $, i.e., $B[\hat{\alpha }_c] = 0$. Also recall the notation $\beta = (1-\alpha )(1-p_{00}) + \alpha p_{11}$. It follows from Eq. (17) that the variance, and hence the MSE, of $\hat{\alpha }_c$ up to terms of order 1/n can be written as:

$$\begin{aligned} \widetilde{MSE}[\hat{\alpha }_c]&= \frac{1}{n} \left[ \beta \frac{\alpha p_{11}}{\beta } \left( 1 - \frac{\alpha p_{11}}{\beta } \right) + (1 - \beta ) \frac{(1-\alpha ) p_{00}}{1-\beta } \left( 1 - \frac{(1-\alpha ) p_{00}}{1-\beta } \right) \right] \nonumber \\&= \frac{\alpha (1-\alpha )}{n} \left[ \frac{(1-p_{00})p_{11}}{\beta } + \frac{p_{00}(1-p_{11})}{1-\beta } \right] . \end{aligned}$$

(42)

To prove Eq. (18), first note that

$$\begin{aligned} \frac{(1-p_{00})p_{11}}{\beta } + \frac{p_{00}(1-p_{11})}{1-\beta } = \frac{(1-p_{00})p_{11} + \beta (p_{00} - p_{11})}{\beta (1-\beta )}. \end{aligned}$$

(43)

The numerator of this equation can be rewritten as follows:

$$\begin{aligned}&(1-p_{00})p_{11} + \beta (p_{00} - p_{11}) \\&= (1-p_{00})p_{11} + (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{00}p_{11} - (1-\alpha )(1-p_{00})p_{11} - \alpha p_{11}^2 \\&= (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{00}p_{11} + \alpha (1-p_{00})p_{11} - \alpha p_{11}^2 \\&= (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{11}(1-p_{11}). \end{aligned}$$

Note that the obtained expression is equal to the numerator of Eq. (41). Write $T = (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{11}(1-p_{11})$ for that expression. It follows that

$$\begin{aligned}&\widetilde{MSE}[\hat{\alpha }_p] - \widetilde{MSE}[\hat{\alpha }_c] \\&= \frac{T}{n(p_{00}+p_{11}-1)^2} - \frac{T\alpha (1-\alpha )}{n\beta (1-\beta )} \\&= \frac{T}{n(p_{00}+p_{11}-1)^2\beta (1-\beta )} \Big [\beta (1-\beta ) - \alpha (1-\alpha )(p_{00}+p_{11}-1)^2\Big ]. \end{aligned}$$

Writing out the second factor in the last expression gives the following:

$$\begin{aligned}&\beta (1-\beta ) - \alpha (1-\alpha )(p_{00}+p_{11}-1)^2 \\&= (1-\alpha )^2 p_{00}(1-p_{00}) + \alpha (1-\alpha )\Big ((1-p_{00})(1-p_{11}) + p_{00}p_{11}\Big ) + \alpha ^2 p_{11}(1-p_{11}) \\&\qquad - \alpha (1-\alpha )(p_{00}+p_{11}-1)^2 \\&= (1-\alpha )^2 p_{00}(1-p_{00}) + \alpha (1-\alpha )\Big (p_{00}(1-p_{00}) + p_{11}(1-p_{11})\Big ) + \alpha ^2 p_{11}(1-p_{11}) \\&= (1-\alpha )p_{00}(1-p_{00}) + \alpha p_{11}(1-p_{11}) \\&= T. \end{aligned}$$

This concludes the proof of Theorem 4.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kloos, K., Meertens, Q., Scholtus, S., Karch, J. (2021). Comparing Correction Methods to Reduce Misclassification Bias. In: Baratchi, M., Cao, L., Kosters, W.A., Lijffijt, J., van Rijn, J.N., Takes, F.W. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2020. Communications in Computer and Information Science, vol 1398. Springer, Cham. https://doi.org/10.1007/978-3-030-76640-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-76640-5_5
Published: 20 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76639-9
Online ISBN: 978-3-030-76640-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparing Correction Methods to Reduce Misclassification Bias

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendix

Appendix

Lemma 1

Proof

Proof

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation