A bootstrapping approach to social media quantification

This work considers the use of classifiers in a downstream aggregation task estimating class proportions, such as estimating the percentage of reviews for a movie with positive sentiment. We derive the bias and variance of the class proportion estimator when taking classification error into account to determine how to best trade off different error types when tuning a classifier for these tasks. Additionally, we propose a method for constructing confidence intervals that correctly adjusts for classification error when estimating these statistics. We conduct experiments on four document classification tasks comparing our methods to prior approaches across classifier thresholds, sample sizes, and label distributions. Prior approaches have focused on providing the most accurate point estimate while this work focuses on the creation of correct confidence intervals that appropriately account for classifier error. Compared to the prior approaches, our methods provide lower error and more accurate confidence intervals.


Introduction
Classifiers are often used in a pipeline toward a downstream task, i.e., the classifier's outputs are used as inputs in another step within a pipeline.This paper considers the downstream goal of obtaining a statistic, specifically, after classifying documents finding the proportion of positively classified instances.This scenario arises often, for example, when using a sentiment classifier on documents to measure overall sentiment of a corpus.This involves aggregating the individually classified messages to calculate an overall sentiment level, e.g., the percentage of messages classified as positive.This task is known as quantification in the data mining community (Forman 2008).
Quantification is challenging because errors introduced in the classification process will cause downstream statistics to be biased.Since no classifier is perfect, we seek to study and address how classification error affects this type of analysis.Moreover, a better understanding of the downstream effect of classifier error will provide a better understanding of the optimal tradeoff of error types when tuning a classifier (e.g., precision versus recall).We discuss these motivating issues in Sect.2, as well as a discussion of prior work in this area.
Our paper has three main contributions over prior work: -We characterize the estimator-bias, variance, and error-of sample proportions when the sample contains noisy classifications.-We propose a more accurate method of constructing confidence intervals around class proportions estimated from noisy samples.-We conduct experiments using our method across classifier thresholds, sample sizes, and class proportions.
We present these contributions from both a theoretical and empirical perspective.To understand the results empirically, we experiment with four datasets for the task of document classification of user-generated content: aggregating the sentiment in movie reviews, and measuring rates of vaccination from social media posts.This work expands on our prior work (Daughton and Paul 2019).

Quantifying class proportions
The quantification problem was first described in seminal work by Forman, who showed that classification errors introduce systematic bias into the calculation of the number of positives (Forman 2005(Forman , 2006(Forman , 2008)).He used the term "classify and count" to describe the naïve quantification approach of simply counting the number of positively classified instances and proposed several methods for adjusting the counts based on the true and false positive rates of the classifier, with some methods motivated specifically for data with imbalanced classes (Forman 2008).For a more comprehensive review of quantification see González et al. (2017).
Here, we present a review of some methods for binary quantification to better contextualize this work.Within the quantification literature, there are several methods that follow the "classify and count" dogma, but add a third step that corrects the estimate using some combination of normalization and classifier metrics.One of the most commonly cited is Forman's Adjusted Count which relies on the true positive rate (tpr) and false positive rate (fpr) to adjust the class proportion (see equation 2).These methods typically rely on the assumption that the true proportion positive (p) are the same in the training and test data, and that thus the true value of p relies only on the classifier (González et al. 2017).Others, like Bella et al. (2010), have proposed the use of the probability average of the classifier to adjust estimates (see equation 3).Their extension of this method to the scaled probability average simply further adjusts this estimate by taking into account the positive predictive values of the positive and negative classes (see equation 4).
Others have used more complex methods, including the expansion of the learning phase to include quantification estimation.Milli et al. (2013) proposed the use of modified decision trees, called quantification trees which are optimized for quantification rather than for classifying individual instances.Barranquero et al. (2013) investigated the use of k−nearest neighbor and weighted k−nearest neighbor algorithms for quantification with the observation that such algorithms benefit from more efficient methods to estimate classifier metrics.
Further, quantification methods are closely related to other concepts in the machine learning community.For example, the task learning from label proportions is the inverse of quantification: it applies when the class proportions are known, but individual labels are unknown (Kück and de Freitas 2005;Musicant et al. 2007;Quadrianto et al. 2009;Yu et al. 2013).Methods of learning from label proportions are used to learn to classify individual instances when only aggregate statistics are available.Some work has applied these techniques to social media tasks, including learning to classify user demographics (Ardehaly and Culotta 2017) and estimating political surveys (Benton et al. 2016).
There are a number of additional implications related to quantification.For example, some have extended the quantification algorithms to explore the effect of concept drift on quantification (Xue and Weiss 2009;Pérez-Gállego et al. 2017) and to count ordinal values (Da San Martino et al. 2016).

Quantification in practice
Quantification is an increasingly widespread application of user-generated content such as social media posts.For example, many have measured public sentiment and attitudes at a large scale by aggregating the results of sentiment models applied to individual messages online (O'Connor et al. 2010;Diakopoulos and Shamma 2010;Bollen et al. 2011;Wang et al. 2012;Mitra et al. 2016).Prior work has shown that the prevalence of influenza can be estimated from the number of tweets mentioning an influenza infection (Culotta 2010;Lamb et al. 2013;Sadilek et al. 2012).Others have used classifiers to study behavior in online communities (Yin et al. 2017;Mitra et al. 2016) or patterns in news coverage (West and Pfeffer 2017).
All of the studies cited above use what is known as the "classify and count" method of quantification (Forman 2008), though they did not refer to it as such; indeed, much work on aggregating user-generated content does not reference related work on quantification, even though quantification is implicitly being performed.We were able to find only a small number of studies that used adjustments when quantifying user-generated content, mostly in the domain of sentiment analysis in social media (Gao andSebastiani 2015, 2016;Nakov et al. 2016;Sebastiani 2018).
To the best of our knowledge, no previous work has fully characterized the expected error of "classify and count" quantification.While Forman (2008) showed that it is biased, and when the bias is an overestimate versus an underestimate, he did not provide a complete formulation of the bias, nor has prior work derived the variance of the estimator (Forman 2008).We derive both in Sect. 4. In our experiments, we also show how the theoretical compares to the empirical error.This type of analysis can provide insight into the tradeoff of different error types.
Second, all previously proposed quantification methods have focused on producing point estimates of class proportions.We argue that for many quantification tasks it is useful to provide confidence intervals around the estimate; indeed, many of the social media studies we cited above constructed confidence intervals or similar statistics, but did not adjust for classification error.Our work tests if the point estimate produced by traditional bootstrapping (the average of all estimates) is more accurate than those produced using the entire sample.We then present an adjusted method for constructing bootstrap-based confidence intervals to correctly account for classification error, described in Sect. 5.
In our experiments, we show that naïvely-constructed confidence intervals are highly inaccurate, and our proposed algorithm is more accurate than simply constructing confidence intervals using statistics adjusted with Forman's methods.This approach can positively impact research that uses quantification.Even a low-performing classifier can be used in downstream analyses and hypothesis-testing, albeit with low statistical power, as long as the uncertainty is correctly quantified.
Our proposed confidence interval adjustment is somewhat related to other methods of accounting for measurement error (Stram et al. 1999;Barbiero and Manzi 2015;Buonaccorsi et al. 2018).Most similar to our work is that of Szpiro and Paciorek (2014), who adjust for errors in inferences that are used for downstream epidemiological analysis.However, their work is focused on complex exposure models rather than classifiers, and the error model is different from the classification errors we consider in this work.

Binary classification
This paper focuses on binary classification, though the approach is straightforward to generalize to multiclass settings.
The training dataset contains N instances X i ∈ ℝ M paired with labels Y i ∈ {0, 1} .A classification function f (X i ;Θ) returns a predicted label Ŷi ∈ {0, 1} for instances whose labels are unknown.We refer to the predicted labels as classifications, and the classification function as a classifier.This paper will generally describe classifiers as traditional machine learning models, in which the parameters Θ are learned to optimize performance on training and validation data.However, our analyses also apply to classification functions using rule-based or dictionary-based approaches, which have been used in the some of the work cited above (O'Connor et al. 2010;Culotta 2010;West and Pfeffer 2017), as long as there is still labeled data on which the classifier can be evaluated.

Classification error
Various evaluation metrics are used to measure the reliability of a classifier, typically measured on held-out test data.In machine learning, the most common metrics are precision and recall.Most of our analyses in this paper use the true and false positive rates, where the true positive rate is the percent of positive instances correctly classified as positive (equivalent to recall), and the false positive rate is the percent of negative instances classified as positive.These correspond to the maximum likelihood estimates of P( Ŷi = 1|Y i = 1) and P( Ŷi = 1|Y i = 0) , respectively.
Classifiers can be tuned to raise or lower the true and false positive rates.By lowering the threshold for a positive classification, more instances will be classified as positive, increasing recall while also increasing the false positive rate.This is often visualized as the receiver operating characteristic (ROC) curve, which shows the true positive rate against the false positive rate.This is similar to a precision-recall curve, which is more common in machine learning, where the false positive rate is replaced with precision.
Precision is also called the positive predictive value, which is the maximum likelihood estimate of P(Y i = 1| Ŷi = 1) .This value is used to construct confidence intervals in Sect. 5.
Table 1 summarizes the measurements used in this paper.

Class proportions
After applying the classifier to data, we consider the question: how many instances were classified positive?We consider the proportion of positively classified instances as an estimator of the true proportion of positive instances: This can be generalized to multiple classes by treating the target class as positive and all others as negative.A proportion is a special case of the mean when the values are binary, so standard results of using sample means as estimators apply to the sample proportion.However, our analyses Ŷi Positive predictive value Negative predictive value 73 Page 4 of 14 do not apply to means in general; they assume binary Y, so we will refer specifically to proportions.

Estimator properties
This section derives the bias, variance, and mean squared error of the estimator p (Eqn. 1) (Lehmann and Casella  1998).These properties depend on the true proportion p as well as the true positive rate of the classifier (denoted in this section) and false positive rate (denoted ), where 1 ≥ ≥ ≥ 0.

Bias
If Ŷi is noise-free ( Ŷi = Y i ), then the sample proportion is unbiased.However, in this section, we show that the estimator is biased when Ŷi depends on classifier error.
Lemma 1 Let p be the sample estimate of p, the proportion of positively labeled instances in a collection.Let be the true positive rate of the classifier, and be the false positive rate. T

h e b i a s o f t h e e s t i m a t o r
since P( Ŷi = 1|Y i = 1) corresponds to the true positive rate, P( Ŷi = 1|Y i = 0) corresponds to the false positive rate, and P(Y i = 1) corresponds to the true proportion of positive instances.◻ From this, there are two straightforward corollaries about when the estimator is unbiased in two extreme cases: the classifier makes no mistakes, and the classifier is no better than random.

Corollary 1
The estimator p is unbiased when = 1 and = 0 (i.e., the classifier is perfect).
Corollary 2 The bias of estimator p is − p when = (i.e., the classifier is no better than random).When = = p, the estimator is unbiased.
While not common in practice, these two scenarios intuitively demonstrate the properties of the estimator.If the classifier is perfect, then the estimator is simply the sample proportion, which is unbiased.If the classifier predictions are random, then it is unbiased if the probability of making a positive prediction is equal to the actual proportion of positive instances.More generally, we consider the relationship between and as a third corollary.

Corollary 3 The estimator p is unbiased when:
To show this relationship more clearly, the top row of Fig. 1 shows the bias at various values of , , and p.For readability, we only show a few values of , rather than the full range of values of and .We show as a function of ; one in which is close to , one in which it is much smaller than , and one in between.
For the bias to be zero, should increase as p increases.We see in the figure that there is a diagonal band of nearzero values that moves upward along values as p increases.However, the position of the band depends on the value of .As decreases, the band moves toward the upper left, favoring larger values even at low p.

Variance
Next, we examine variance of the estimator p.
Lemma 2 Let p be the sample estimate of p, with a sample size of n.Let be the true positive rate of the classifier, and be the false positive rate.The variance of the estimator p is: Proof By standard results, Var(p) = 1 n Var( Ŷi ) , and where we have that Lemma 1, and The variance of the estimator p is minimized when This con- dition is satisfied when any of the following relationships are true: Page 5 of 14 73 The bottom row of Fig. 1 shows the variance when n = 1 .The variance tends to shrink as p and shrink, with a stronger pattern when is smaller.For most values of p, variance tends to be lower when is lower (i.e., lower recall, but fewer false positives).

Error
Finally, the mean squared error (MSE) of the estimator is given by: MSE(p) = Bias(p) 2 + Var(p).
We experimentally compare this expected error to the actual error on real datasets.In practice, estimating the theoretical error requires knowledge of p, which is unknown.We suggest that for the purpose of estimating the error, p can be set to its value in historical data.

Confidence intervals
It is important to be able to quantify the certainty of an estimate, for example with a confidence interval of the estimate.
Traditionally, confidence intervals are a function of sample size and variability in the data.However, when estimating statistics from classifiers, an additional layer of uncertainty is introduced, as not all instances will be labeled correctly.
Here, we present a nonparametric approach to constructing a confidence interval of p based on bootstrapping.

Bootstrapping: Review
Bootstrapping, or bootstrap resampling, is a procedure to simulate the statistics one would obtain when sampling from a distribution (Efron 1979;Efron and Tibshirani 1993).A bootstrapped estimate (for example, an estimate p ) is obtained by sampling N instances with replacement from the original dataset of size N, then calculating the statistic (e.g., p ) on the set of sampled instances.This procedure can be To construct a c% confidence interval, the bootstrapped estimates can be sorted, and the range of the middle c% of values can be taken as the interval.

Error-adjusted bootstrapping
If bootstrapping is applied to noisy classifications rather than true labels, then the samples will not be drawn from the correct distribution.We propose an adjustment to the sampling procedure that draws from the actual distribution of the data.
For each bootstrap sample, after selecting the instances (sampled with replacement), we randomly sample the labels of the instances according to the confusion matrix of the classifier.If an instance is classified positive, we sample the label according to P(Y| Ŷi = 1) ; if an instance is classi- fied negative, we sample the label according to P(Y| Ŷi = 0) .In this way, rather than treating the classifications as labels directly, we sample labels based on the probability that the classifier predicted an incorrect label.This procedure simulates the classification process in addition to the sampling process when obtaining an estimate.
We refer to this approach as error-adjusted bootstrapping.The steps for obtained a set of error-adjusted bootstrapped samples are detailed in Algorithm 1.

Correctness of algorithm
The underlying assumption of bootstrap resampling is that the instances are i.i.d. and that uniformly sampling an instance is a draw from P(Y).If the distribution of classifications P( Ŷ) is different from the distribution of labels P(Y), then randomly sampling from the classifier outputs will not correctly draw from P(Y).
Our approach uses the distribution P( Ŷ) and pre- dictive values P(Y| Ŷ) to correctly calculate P(Y): As a generative process, sampling from this marginal distribution corresponds to the following steps for each instance i: (i) Sample ŷi ∼ P( Ŷ) ; (ii) Sample y i ∼ P(Y| Ŷi = ŷi ) .This matches Algorithm 1, which thus samples a label y according to the true label distribution P(Y) rather than the classification distribution P( Ŷ).

Predictive value estimates
As described so far, we assume the positive predictive value P(Y| Ŷ = 1) and negative predictive value P(Y| Ŷ = 0) are known.We propose two approaches to estimating these values.The first is to use cross-validation to provide point estimates of the positive and negative predictive values at each threshold of interest.This is the same approach used in prior work (Forman 2008).
The second approach extends Algorithm 1 to use a posterior distribution over predictive values.We do this by fitting a beta distribution to the individual estimates from cross-validation.We then draw a new estimate of the predictive values before sampling each label y j during boot- strapping.We refer to this in experiments as the extended algorithm.

Experiments
We now experiment with estimating class proportions using document classifiers on two datasets.There are two goals of these experiments: first, to validate the above theory experimentally, and second, to show how class estimates vary in practice as the classification threshold is adjusted.

Datasets and classification details
We experiment with binary document classification on four datasets of online user-generated content: -Flu Vaccination: A set of 10,000 tweets labeled with if the tweet indicates that someone has received an influenza vaccination (i.e., a seasonal flu shot) (Huang et  Approximately 77% of the reviews were positive. Classification was done using binary logistic regression classifiers implemented with scikit-learn (Pedregosa and others 2011).Grid search using fivefold cross-validation on the training data was used to tune the 2 regularization parameter.For all classifiers, unigrams were used to build feature sets.While more extensive feature engineering or feature selection techniques might result in higher performing classifiers, we constructed the experiments this way to create a simple and equitable comparison between the datasets.ROC curves are shown in Figure S1.We note that classification performance is extremely high for the Yelp dataset (area under the ROC curve is nearly 1), while error rates are higher for the Twitter datasets.
We experiment with different classification thresholds, meaning we set ŷi = 1 if P(y i = 1|x i ) >  for a threshold .Increasing the threshold will lower the true positive rate while also lowering the false positive rate , thus trading off different error types.
For the adjustment methods, we calculate the error rates and predictive values using cross-validation.In the extended version of Algorithm 1, we additionally sample the predictive values based on the cross-validation distribution.

Bootstrapping benefits
Before testing estimate adjustment methods, we first experiment to see if bootstrapping provides benefits over standard methods that provide a single point estimate.To do this, we compare the error when obtaining a point estimate to the average estimate obtained from a bootstrapped estimate (see Fig. 2).
Error rates between the two methods are extremely close.In many instances, the lines are close to entirely overlapping.Bootstrapping alone may provide a benefit, but it is extremely small.The most obvious benefit is in the flu infection dataset which is also the smallest.We also compare error between an adjusted bootstrapped estimate to the point estimate (see Fig. 2).For three out of four datasets, our adjustment reduces error across all thresholds.However, in the fourth, the flu infection dataset, error is actually larger in the adjusted estimate.We hypothesize that this is because this is the smallest dataset, so the estimates of the error rates may be less accurate.

Baseline
We experimentally compare to the "adjusted counts" method from Forman (2008).Here, the true positive rate ( ) and the false positive rate ( ) are used to obtain an adjusted estimate of the percent of positive instances: (2) p ≈ p −   −  , where p is the fraction estimated positive by the classifier.The estimate must be truncated to the range [0, 1].While Forman (2008) introduced multiple methods for estimating p, the adjusted count method was selected for use as a baseline because it consistently performed well in general. 2n our experiments we calculate the adjusted counts within each bootstrapping iteration, and then construct confidence intervals of the adjusted counts.In Bella et al.'s scaled probability estimate (SPA) method, the probability estimates of the classifier are taken advantage of (Bella et al. 2010).The probability average (PA) is the average of the probabilities given by the classifier on the test dataset: This is then scaled using the positive predictive value of the positive class and the positive predictive value of the negative class so that the final value is between [0,1]:

Estimator error
We then calculate the mean squared error (MSE) of the classifier estimate of p on each test dataset, compared to the true proportion given by the labels.We then additionally apply bootstrapping to each group and estimate p as the mean of the proportions across the bootstrap samples.Doing so allows us to investigate if error-adjusted bootstrapping produces better estimates of p while also looking at our orig- inal motivation of producing more accurate confidence intervals.100 bootstrap samples are collected in all experiments.
The top row of each panel of Fig. 3 show the observed MSE (orange) with and without making error adjustments during bootstrapping.
In general, differences in error are quite small.We also find that patterns vary across different datasets.Algorithm 1 results in a lower error than baseline methods in the flu infection dataset, but the Forman-adjusted method is the lowest in the flu vaccination and IMDB dataset, and the Bella-adjusted method is the lowest in the Yelp dataset (with Algorithm 1 coming in a close second).
We also plot the theoretical MSE calculated using the cross-validation estimates of and , with p estimated from the training data.While the magnitude of the theoretical error does not perfectly match the observed error, the shape mimics the observed (unadjusted) error very closely.

Confidence intervals
We examine the empirical characteristics of 95% confidence intervals constructed using bootstrap sampling, with and without making various error adjustments.We look at two characteristics: the fraction of times that the true value is contained in the interval (which should be 95%, asymptotically), as well as the size of the intervals.
The bottom row of each panel in Fig. 3 show these characteristics.The blue lines show the fraction of correct values contained in the 95% confidence intervals.There is some variation across datasets, but we in general see that the confidence intervals constructed using error-adjusted bootstrapping correctly capture the true values around 95% of the time.There are two instances (the IMDB dataset, and the flu vaccination dataset) where the extended version of Colors show each dataset, dashed lines (---) show the unweighted bootstrap methodthe point estimate is the average of the bootstrapped estimates.Solid lines show the point estimate without using bootstrapping.Dashed-dot lines (-.) show the point estimate when using Algorithm 1 to adjust the bootstrap sample Algorithm 1 more accurately captures the true values.There are also some instances where the extended Algorithm 1 captures the true value more than 95% of the time, suggesting that in some contexts this method may unnecessarily overcompensate for uncertainty in the predictive values.
Importantly, we see that doing traditional bootstrapping without adjusting for classification error can severely affect the reliability of the confidence intervals.In the flu infection dataset, the unadjusted 95% confidence interval is only correct 85% of the time at best and is as low as 65% at a suboptimal threshold.This is even more striking in the flu vaccination dataset, which is noticeably noisier than other datasets.Here, a 95% confidence interval behaves like a 20% confidence interval in the worst scenario.
In general, our extended algorithm outperforms other methods, including Algorithm 1.The Forman-adjusted count method is consistently more accurate than doing no adjustment, but is generally less accurate than our extended method.Bella's method seems to be the least accurate, except in the flu vaccination dataset where our algorithm performs the worst.
The orange lines show the size of the intervals, to quantify how much wider the intervals must be to correctly adjust for error.In general, traditional bootstrapping produces some of the smallest confidence intervals.The Forman-adjusted and Bella-adjusted methods produce slightly larger confidence intervals, and the confidence intervals produced using our algorithms are the widest.

Sample size
In these experiments, we vary the number of samples per group by randomly sampling with replacement to achieve a predefined number of instances per group ( n = 10, 20, 30, or 40).This experiment and the class distribution experiment presented in 6.4.2 were not performed on the IMDB dataset because the test dataset is constructed such that all reviews for a given movie are either positive or negative, though the distribution overall is 50%.This makes it incompatible with the sampling scheme that we used for these experiments.
The results of these experiments on individual datasets are shown in supplementary Figs.S2 -S7.We see that the error for Algorithm 1 (e.g., Fig. S3) is among the smallest, though Bella et al.'s method is often a bit smaller.However, Algorithm 1 also has much closer to the right fraction in the confidence interval (e.g., Fig. S2).Note that the extended algorithm performs better in this capacity even as sample size changes, while the regular Algorithm 1 starts to drop at threshold extremes (though it still tends to outperform Bella and Forman in these contexts).
Data aggregated across datasets at a threshold of 0.5 are shown in Figs. 4 and 5.While error rates vary, patterns are roughly the same regardless of sample sizes.The Algorithm 1 method has the lowest error in almost every dataset while the extended method and the Bella-adjusted methods have the highest.This is unsurprising given that the extended method typically generates a confidence interval that is slightly larger than optimal.In addition, we see that as the number of items per group increases, all methods produce more accurate confidence intervals (Fig. 5), but our method is arguably the most consistent.

Class distribution
Next, we consider the impact of class distribution on each method.To sample to achieve the desired fraction positive, data were categorized into respective groups (e.g., each week in the Twitter data, or each business in the Yelp data).Within each group, we weighted the new sample by the desired fraction positive.For example, in the Yelp data, if the desired fraction positive was 25%, there was a 25% chance that we would pick any positive review and a 75% chance of selecting a negative review.The sampling was only performed if there were at least 10 items in the group.Again, we did not use the IMDB data for this.Supplementary Figs.S8 -S13 show the results of these experiments across all thresholds on the individual datasets.In general, we again see that Algorithm 1 outperforms all baselines regardless of the class distribution and that the Forman method performs better than the Bella et al. method.Data aggregated across datasets at a threshold of 0.5 are shown in Figs. 6 and 7. Algorithm 1 produces the smallest error, followed generally by the extended Algorithm 1 and Bella et al.Again we also see that the Algorithm 1 and extended Algorithm 1 methods produce the most consistent, and most accurate confidence intervals across all class distributions, and the Bella et al. and Forman methods perform more modestly in cases of greater class imbalance.

Use case: Vaccination surveillance
Lastly, we consider how this type of analysis relates to one of the motivating applications, which is using the proportion of vaccine-related tweets to measure vaccination rates in a population.To do this, we applied the classifier trained on the Twitter dataset to a larger set of approximately 1 million tweets, from 2017.At different classification thresholds, we estimate the proportion of positive tweets in each month, and we compare these proportions to official flu vaccination data from the US Centers for Disease Control and Prevention (CDC), to evaluate how well monthly variations in vaccine tweets track true vaccination behavior (Huang et al. 2017).We measure this with Pearson correlation, calculating the proportions using adjusted bootstrapping from Algorithm 1 versus no adjustment.
Figure 8 shows the correlations between Twitter proportions and CDC data.With the exception of a large and unexplained drop in correlation when the threshold is 0.6, the correlation closely follows the mean squared error in Fig. 3, with an optimal correlation at a threshold of 0.5, and with low thresholds resulting in generally worse correlations than high thresholds.We note that minimizing error on tweet proportions is different from maximizing similarity to the CDC data, so it is not a priori obvious that the correlations would follow a similar pattern as error.The fact that they do have a similar pattern suggests that this type of analysis of classification error can be of additional use to downstream tasks that use class proportions in an indirect way.
While error-adjusted bootstrapping reduced the error by substantial amounts in the class proportion estimate (Fig. 3), we do not see comparably large gains in correlations in this task when comparing the adjusted estimates to unadjusted estimates.However, error-adjusted bootstrapping seems to provide a small benefit at some thresholds.

Conclusion
We have analyzed, both theoretically and empirically, how classification error propagates to estimates of class proportions, which is often measured incorrectly in practice despite being a common application of classifiers to user-generated data.We found that confidence intervals constructed without accounting for classification error could be surprisingly inaccurate in our experiments (e.g., a 95% interval behaves like a 65% interval), highlighting the need to be careful about using classifiers in a multistage pipeline.We showed that a simple-to-implement adjustment to bootstrap sampling can correct for this, and this adjustment can reduce mean squared error when estimating proportions.While we show the adjustment using text-based data, it is trivial to apply it to any classification dataset where classifier metrics can be reasonably estimated from training data.We suggest that the type of analysis presented here can help practitioners trade off error types when tuning classifiers, and that error adjustments should be made when calculating statistics from classifier output.

Fig. 1
Fig. 1 Bias (top) and variance (bottom) of the estimated proportion of positive instances ( p ) as a function of the true positive rate ( ), false positive rate ( ≤ ), and the true proportion of positive instances (p) 1 https:// www.yelp.com/ datas et 73 Page 8 of 14

Fig. 2
Fig.2Bootstrap importance.Colors show each dataset, dashed lines (---) show the unweighted bootstrap methodthe point estimate is the average of the bootstrapped estimates.Solid lines show the point estimate without using bootstrapping.Dashed-dot lines (-.) show the point estimate when using Algorithm 1 to adjust the bootstrap sample

Fig. 3
Fig.3Top rows: the mean squared error of estimating the proportion positive in test data at different classification thresholds, with and without making sampling adjustments, as well as the theoretical error based on the estimated true and false positive rates.Bottom rows: the size of 95% confidence intervals (orange) and fraction of true values contained within 95% confidence intervals (blue) at different classification thresholds, when constructing intervals with and without adjusting for error.With erroradjusted bootstrapping, the true value should theoretically be contained in the interval 95% of the time (shown by the dotted gray line)

Fig. 4
Fig. 4 Mean square error stratified by sample size across all datasets and correction methods.Data are shown at a threshold of 0.5

Fig. 5
Fig. 5 Fraction of true values included in the confidence interval stratified by sample size across all datasets and correction methods.Data are shown at a threshold of 0.5

Fig. 6
Fig. 6 Class distribution comparison.MSE is shown across all datasets and correction methods.Data are shown at a threshold of 0.5

Table 1
Definitions of classification metrics used in our analyses, estimated as functions of the number of true and false positives (TP and FP) and true and false negatives(TN and FN)