The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus

The classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such as education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds and, within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.


Introduction
With the advent of large-scale computational solutions being more and more readily available in the humanities and social sciences, quantitative political science is rapidly being transformed by the opportunities offered by machine learning solutions to both existing and new problems. Agenda setting in the times of mass use of social media (Barberá et al. 2019), predicting roll call votes in the US (Bonica 2018) or polarization of lawmakers in the UK (Peterson and Spirling 2018), identifying media frames using newspaper articles (Nicholls and Culpepper 2020), or assessing climate change discourse (Farrell 2016) are just a few of the possibilities opened up by advances in machine learning. Moreover, researchers are also exploring how to utilize neural network architectures in order to use images as data for social science research (Williams et al. 2020).
The advances made in computational social science (of which both computational political science and computational communication science are part of) often leveraged cutting edge machine learning approaches as well as making the case why these methods should appeal to the broader social science research community (Theocharis and Jungherr 2020;Wilkerson and Casas 2017). These advances in the adoption of text mining and machine learning solutions in political science have also opened up new possibilities for major international comparative programs like the Comparative Agendas Project (CAP) and the Manifesto Research on Political Representation (MARPOR) undertaking. As Barberá et al. (2019) show, there are several possible approaches to classify news items with varying degrees of success.
As these datasets increase in size, so does the human and computational effort of labeling items. This scaling problem can be managed by using a hybrid workflow that relies on an efficient combination of human and machine coding. There are some recent proposals that rely on such an approach and demonstrated promising results on Danish and Hungarian language corpora (Loftis and Mortensen 2020;Sebők and Kacsuk 2021). The common theme of this strand of the literature is the quest for finding the optimal workload distribution between the costly human labor and the (often imprecise) supervised machine learning (SML) classification.
The goal of this article is to investigate how a hybrid, human-machine workflow generalizes to an English language corpus. We undertake this challenge by classifying articles in a novel New York Times (NYT) corpus according to the CAP coding scheme (Baumgartner et al. 2019), which distinguishes between 21 major topics. We run simulations on this corpus and compare our results to the corresponding, publicly available human-coded dataset (Boydstun 2013). In keeping with the approach of previous studies, we use Support Vector Machine classifiers (SVM) as the baseline algorithm, but we also extend the scope of the comparison by using Naïve Bayes classifiers (NB) as well. In total, we examine seven SML setups in this article: a multi-class NB, comparison of SVM and NB ensemble workflow, and finally, five different SML hybrid workflows using SVM (from no human coding to one incorporating an element of active learning).
The simulations in this paper offer valuable insights into the potential of incorporating human validation between classification rounds-as opposed to only applying validation at the beginning or the end of the classification process-of a so-called Hybrid Binary Snowball (HBS) workflow (for the details see the discussion of Fig. 1). Alongside the simulation results for various workflow setups, we also introduce indicators that allow us to quantify the gains of using such a hybrid approach where the machine learning and human coding elements are combined to potentially halve the human coding work while attaining an above 90% precision in classification. We contribute to recent research on the gains in efficiency when coders switch to validation of already coded documents from coding "virgin" texts (Loftis and Mortensen, 2020).
This paper provides evidence that a hybrid workflow can be used in large-scale academic projects to address project-specific bottlenecks, save costs and allocate work more efficiently. We believe that the methods demonstrated below showcase a robust and modular approach that can be applied in many text classification contexts (not limited to policy topic classification). This modularity applies to the supervised machine learning classifiers, adjusting the ensemble to the computational constraints and accomodating the workflow for the difficulty of the classification task. This paper contributes to the growing literature on hybrid workflows by showing that employing a human-machine division of labor can bring tangible benefits for large-scale text classification projects. Moreover, we examine a range of possible supervised algorithms and provide benchmark results for projects using the CAP media labels for categorizing media corpora. The evidence presented in this paper clearly shows the benefits of hybrid workflows and helps social science researchers to make informed ex-ante decisions about structuring large-scale text classification projects.
In what follows, we first cover the various text classification approaches in the political science and communication fields with a special emphasis on the applications and limitations of supervised machine learning solutions. To situate our results, we also highlight the efforts of the CAP research community aimed at automated content analysis. In the methods and data section the hybrid workflow is covered in more detail as well as the reasons for choosing various SML classifiers. The next section shows our simulation results, and the Discussion details the difference between the various SML classifiers and the effectiveness of the hybrid workflow with different levels of human validation. In the Conclusion we discuss the implications of using human validation efficiently in large scale projects and we also highlight some further avenues of research.

Literature review
Computer-assisted text analysis has a wide range of tools for analysis, as already detailed in the seminal overview by Grimmer and Stewart (2013). These include using a dictionary to score each document in a corpus, supervised machine learning approaches for classifying documents, and probabilistic topic models (see also Lucas et al. (2015)). A popular approach in this line of research is the use of dictionaries to use word frequency to assign a class to a given document. As Laver and Garry (2000) demonstrate, dictionaries can be used to substitute expert coding of party manifestos. One key advantage of the dictionarybased method is that one can rely on already existing dictionaries and apply them to the appropriate domain. The use of the Lexicoder dictionary to code the sentiment of political communications and economic news demonstrates how out-of-the-box dictionaries can be used effectively (Soroka et al. 2015;Young and Soroka 2012).
However, for text classification tasks, the literature often turns to various supervised machine learning (SML) algorithms. These include, for example, Support Vector Machines, Naïve Bayes classifier, random forests, word embeddings, and various neural network architectures. It is generally agreed that the general SML approach is well suited for classifying large bodies of text given that a training data set of high quality is available (Hillard et al. 2008;Purpura and Hillard 2006). The SVM classifier is often used as a 1 3 go-to option as its performance is often superior to the Naïve Bayes, a frequent alternative (Loftis and Mortensen 2020).
A recent comparison between dictionary-based classification and SML forcefully demonstrated that the supervised approach consistently outperforms dictionaries both in terms of accuracy and in terms of precision. 1 Barberá et al. (2019) pit the SentiStrengh (Thelwall et al. 2012), the Lexicoder (Young and Soroka 2012), and an economic dictionary from Hopkins et al. (2017) against regularized logistic regression. According to their findings, the performance of the baseline SML approach hinges on the training sample size, although-as they show-even with as few as 250 observations, the performance of SML already matches the dictionary results (and with 2000 observations, it significantly surpasses it). These results reinforce the already mounting evidence in the quantitative social sciences that SML-based approaches are outperforming the dictionary-based method given the sufficient quality and quantity of training data. It is important to highlight that the two classification methods can be used together to amplify each ' 'solution's strength. Dun et al. (2020) experiments with such a combination and show that one can construct a virtuous loop between the dictionary and SML approaches which results in better dictionary building and better SML results.
The wider political science and social science literature in general is also adopting newer and newer tools, including word embedding based methods, such as GloVe (Pennington et al. 2014) and Word2Vec (Mikolov et al., 2013). In a recent overview, Rodriguez et al. (2021) show that using pre-trained word embeddings in some cases can perform as well as humans in certain tasks.
Regarding the performance of the SML classification across various domains and languages, the social science body of research is considerably thinner. Burscher et al. (2015) examine SML performance on a set of news articles from three different Dutch newspapers and parliamentary questions over 16 years, and they find that the composition of the training set is crucial for classifier performance.
The Comparative Agendas Project (CAP) is one of the premier international scholarly networks in comparative politics, which applies a consistent multi-class policy topic classification scheme. The CAP research project also provides the data and codebook necessary for SML applications in multiple languages, which presents a fertile ground for experiments for teasing out the most efficient way of using human coders in the research workflow.
In an early example of addressing the transition to machine coding in CAP, Hillard et al. (2008) confirms that "for accurately, reliably, and efficiently classifying large numbers of complex individual events, supervised learning systems are currently the best option" (p. 44). Nearly a decade later, these findings are reconfirmed again and again as the comprehensive study of Barberá et al. (2019: 35) also concludes that one should "classify by machine but verify by human[s]". However, the multi-class challenge posed by using the CAP coding scheme also means that even with large training data and using various SML methods, precision results are rarely above 80% in the literature.
Despite the enormous amount of textual data needed to be coded, many of the CAP research teams still rely exclusively on human coding. As the relative cost of computational power steadily decreased, and various ML tools have become available in widely-used programming languages such as R and Python, some country projects of CAP started to branch out to automated policy agenda classification (either dictionary-based or a combination of dictionaries and SML, see Albaugh et al., 2013Albaugh et al., , 2014Burscher et al., 2015). Nevertheless, most CAP project leaders reported that they did not incorporate ML approaches in their workflow. 2 While still cheaper than coding an entire corpus by hand, SML classifiers require a training set that might include up to 20% of the entire corpus, and human coding of this portion can also mean sizeable expenses for smaller research teams. There have been some attempts to cut down on costs, either by coding only the leading paragraph for the training set, or implementing a workflow to optimize the human coding and machine classification (Burscher et al. 2015;Loftis and Mortensen 2020;Sebők and Kacsuk 2021). Recent research also emphasizes the gains in efficiency when coders switch to validation of already coded documents from coding "virgin" texts (Loftis and Mortensen 2020). These hybrid approaches also allow for quantifying the gains of using such a human-machine workflow where the SML classification and human coding elements are combined to potentially halve the human coding work while attaining an above 80% precision in classification.
The road paved by previous research is shown in Table 1. Notably, there is no clear 'leader of the pack' approach that dominates the literature. The most recent advances in developing hybrid workflows show that researchers are now focusing on augmenting the SML approaches to break through the ~ 0.8 precision ceiling. The above overview of the key literature shows that our test of the proposed hybrid workflow addresses a gap in the literature and should contribute to our further understanding of the trade-offs and benefits of various human-machine hybrid workflows.

The classification workflow
In this article, we follow the hybrid workflow detailed in Sebők and Kacsuk (2021) which is designed for imbalanced multi-class text classification problems for both single corpus and multi-corpora projects. By default, supervised machine learning methods are hybrid approaches in a sense that the training sample used to "teach" the SML algorithm is labelled by human coders. However, with the below-described hybrid workflow, we emphasize the division of labor between the various human inputs (classifying and validating) and the SML classifier. Figure 1 presents the outline of this workflow with a few modifications from the abovementioned article. We focus on the problems researchers might face during coding large bodies of texts, and the workflow variants detailed in the latter half of the article are designed with this production mentality in mind. The implication of this viewpoint is that we put an emphasis on precision (the fraction of relevant instances among the retrieved instances) and recall (the fraction of the total amount of relevant instances that were actually retrieved) separately rather than some aggregated metrics, such as the commonly used F1 score (the harmonic mean of the precision and recall scores), which might cover up insufficient precision values. Table 1 Overview and comparison of previous SML results * CAP major topics and subtopics, respectively, ** Numbers are for the two label sizes, respectively Article The reason precision is prioritized so highly in this approach is that, as we explain below, we have to make sure that the unvalidated results from the machine classification can be reliably used in future coding and research. Furthermore, as will be discussed in the second half of the article, each project faces idiosyncratic research constraints, be it a lack of funding for human coding or a lack of sufficiently large computational infrastructure. The flexibility of the hybrid workflow model allows for accommodating setups that are tailored to the specific limitations of the project at hand. It is important to note that there will be a portion of the newly classifiable texts that will not lend themselves to classification by machine learning and will have to be coded by human experts. This is in line with the goal of moving as much work as possible to machine-based classification, with only the remaining items requiring human attention.
In order to obtain comparable results over language domains and SML classifiers, we mirror the workflow detailed in Sebők and Kacsuk (2021)and also apply the text pre-processing steps that are standard in the literature, such as removing stopwords, stemming, Fig. 1 Overview of the hybrid binary snowball workflow. Adapted from Sebők and Kacsuk 2021 and term frequency-inverse document frequency (tf-idf) weighting (Denny and Spirling 2018).
The SML classification is conducted in multiple rounds, and within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). 3 If all the SML results agree on a single label for a document, then it is classified as such (this approach is also called a "voting ensemble"). We repeat this ensemble voting in each round. 4 The advantages of classifying corpora in this way is that human validation can be incorporated between rounds (either for validation or for active learning where ambiguous cases are referred to the human "oracles"). For the purposes of the simulations below, we use three-fold cross-validation, for which we separate the data into three parts. Two are used as the equivalent of the human-coded initial training set, and the last third is the analog of newly classifiable documents.
Another important feature of this approach is the binary decomposition which simplifies the multi-class classification problem into a series of binary ones by classifying each CAP major topic code against the rest of the codes. This is commonly called a "one versus all" approach. As an example, this means that the SML classifier first looks at the document and decides if it is "macroeconomics" (which is the first category of the CAP major topics) or "other". This logic is then iterated over the whole corpus and for all major topics. This setup allows for a simplified human validation-as opposed to full multi-class codingwhere the coders only have to make a binary decision ("correct" or "incorrect") rather than choose from the whole range of available major topics. The advantages of binary classification have been long discussed in the machine learning community and during our measurements we also found this transformation to be crucial to working on imbalanced multiclass text classification tasks (see also Allwein et al. 2000).
The third element in the workflow loop consists of an iterative expansion of the training set (termed snowballing in Sebők and Kacsuk 2021) where items are added to the training set without human validation either automatically or-when human validation of samples is involved-above a certain sample precision threshold. Relying on adding non-validated units to the training set involves an element of trade-off between costs saved and precision lost, with the former significantly outweighing the latter, as we demonstrate below.
The iterative expansion of the training set is an oft-used technique to enhance the SML performance (Olsson 2009); however, it is important to note, that traditionally the selected elements are classified with their correct labels by the queried oracle (most often a human agent). In snowballing, the emphasis is on employing non-validated elements in the process of training set expansion. This does not mean that active learning cannot be incorporated into the workflow, as we will go on to demonstrate in the second half of the discussion, but it is nevertheless important to emphasize that snowballing is a very different approach of training set expansion, one that is aimed at cost-performance optimization (as opposed to identifying and dealing with items which are hard to classify in active learning).

Data
Our aim in this study is to extend the external validity of previous studies on hybrid multi-class classification in comparative politics and communication, as well as to test the impact of different setups on research results. We undertake these tasks by using a preexisting hand-coded dataset of New York Times articles (Barberá et al. 2019;Boydstun 2013;Soroka et al. 2015) and adding the text of the lead paragraphs to these articles via the New York Times API. 5 This data allows us to evaluate the performance of the hybrid workflow against the gold standard of human coding by using parts of this data without the human labels and then validating the machine labeled results against the human-coded data. 6 Moreover, this approach also makes it possible to simulate how human coders would provide validation between each SML classification round. While these studies obtained the corpus using the Lexis-Nexis database, we opted to download our corpus via the freely available New York Times API (https:// devel oper. nytim es. com/ apis). The exact API calls and replication code for the corpus can be accessed at the Harvard Dataverse repository: https:// doi.org/10.7910/DVN/I24CYV . For the sake of convenience, we have also provided a preprocessed and anonymized version of our corpus, in which all texts are tokenized, stemmed, changed to lower case, stripped of numbers and punctuation, and tokens are alphabetically sorted; basically providing a summary of the word distributions for each text in accordance with NYTimes copyright and API terms. As a result, the methods working under the "bag of words" assumption will work with this preprocessed corpus, however, the actual content of the lead paragraphs cannot be discerned. The original metadata of the Comparative Agendas Project dataset is available here: https:// www. compa rativ eagen das. net/ files/ nyt-frontpage-compl ete-data. More information about the data collection, variables, and topic coding of the original dataset can be found in their codebook. The original dataset covers 31,034 observations spanning the years 1996-2006. 6 For a thorough discussion on the importance and pitfalls of validation, see Song et al. (2020).
Our new corpus contains 28 548 documents from 1996 to 2006. The difference in sample size between our API accessed data, and the coded NYT data is due to the reduced availability of matched text data through the NYT API (instances where captions or title or date did not create a match between the two datasets). For the main descriptive statistics and major code distribution of our new dataset, see Fig. 2. The four most populous categories in the sample are "International Affairs", "Defense", "Government Operations", and "Other". The "Other" category is a catch-all for all CAP media categories above code 21.

Baseline simulations
In order to investigate the external validity of the hybrid workflow highlighted in the introduction, we apply this approach to the English language lead paragraphs of the New York Times. Our aim is to classify them into the correct one (the double-blind hand-coded label) of the 21 CAP major categories. To establish a baseline, we classified the documents in the corpus using the commonly used supervised methods (see Table 1.): Naïve Bayes, Random Forest, LASSO regression, Support Vector Machine. 7 The aim of this is to have a baseline without using the multi-round, hybrid ensemble setup that closely resembles prior research designs and results, as discussed above. The category-wise precision and recall scores for this setup are shown in Table 2. Given the focus of this paper, we highlight the disaggregated results, as opposed to composite indicators such as the F1 score, as they convey more detailed information. The focus on precision results is also a necessity as-based on our experiments-the snowballing element relies on achieving a 75% or higher precision to work, and if the SML classification is not able to reach this threshold, the snowballing will result in an undesirable drop in the performance of the SML classifier.
In line with what one could have expected based on the reviewed literature, neither the precision nor the recall exceeds 80% for these non-hybrid baseline setups. The poorest performer is the Random Forest classifier (both in terms of precision and recall), while the LASSO regression and SVM perform relatively well. Overall, the SVM produces the best results due to its second-best precision and best recall. However, Table 2 shows that the performance of each classifier varies greatly across categories, with only a handful of Fig. 3 Scores for "machine only" NB and "machine only" SVM workflows. Dotted lines separate cases where at least one classifier reached 0.8 precision or 0.5 recall instances of above 80% precision or recall (out of the 84 category predictions, only 11 and 10 categories, respectively). Now we move to the hybrid setups, comparing the Naïve Bayes and SVM performance. These two approaches were selected based on Tables 1 and 2: the NB classifier is a widely used benchmark in the literature that performed relatively well in the baseline modeling; the SVM offers balanced performance, and it serves as a possible improvement over the NB approach. Three rounds of classification are applied with a 7 sample × 7 model SML ensemble. For each round of classification, we employ automatic snowballing for all newly classified elements, regardless of any precision threshold. Figure 3 compares the SVM, and Naïve Bayes-based SML approaches.
When checking the results against the baseline multi-class NB classifier, the improvement is apparent for both the NB and SVM-based classification processes. For the Naïve Bayes classifier, the advantages of the hybrid workflow are apparent: there are six categories with above 80% precision, and out of these three (Energy, Education, and International Affairs) are above 85%, with 86.1%, 85.7%, 85.3% precision scores respectively. The gain in precision is paid for in recall, however, as panel B on Fig. 3 demonstrates. With the hybrid workflow, only four categories are above 60% recall, and only Government Operations reach 72%. This, however, is an expected result. The simple multi-class Naïve Base classifier assigns a class to every single text in the corpus, whereas the classifying ensemble in the hybrid workflow only labels texts that are unanimously agreed on by all models in the ensemble.
Furthermore, as our literature review suggested, the SVM-based workflow outperformed the one building on Naïve Bayes classifiers. Matched against the NB setup (leaving everything unchanged but the SML algorithm), the SVM-based workflow performs better in 19 categories out of the 21 based on precision and 13 times for recall. The gains in precision are quite large, and the hybrid workflow using the 7 samples × 7 models setup and SVM classifiers is able to produce precision values above 85% in 10 categories and above 80% in 16 categories. This performance is a significant improvement compared to previous SML approaches to classifying news documents into the 21 CAP major categories. In light of the results in Fig. 3, the SVM classifier-based workflow clearly performs better than the Naïve Bayes classifier-based one. Therefore, for the remainder of this article, we focus on measurements relying on SVM. In order to determine the optimal number of samples and training rounds to be used in our further simulations, we run a comparison of the classifying ensembles from 1 sample × 1 model all the way to 10 samples × 10 models. The precision and recall trade-offs are shown in Fig. 4.
Based on these simulations, we settled on a 3 × 3 setup for our further work as it was the first to achieve precision over 85%. This choice is, of course, specific to our desired results, as the recall difference between different setups is below 0.3 percentage points. While our selection criteria depend on the rather artificial 85% cut-off point, this can vary from project to project as each has its idiosyncratic resource constraints (and, therefore, different tolerance for lower recall). Finally, we note that the larger the ensemble, the more computation time it will take to get the necessary results. This means that any incremental increase in precision for recall might not be a worthy trade-off if the underlying infrastructure is costly or slow. 8

How to use human annotators?
One of the most important questions for hybrid workflow solutions is how to allocate human working hours in the most optimal fashion. Here, we once more emphasize the problem of the "human touch": is it best to use human annotators as coders or as validators, and in what proportion. As it was demonstrated by Loftis and Mortensen (2020), coders reliably performed better when validating rather than labeling virgin documents. Following this work, we also contacted a national CAP research group on their experience. Based on discussions with expert coders who supervise coding teams using a hybrid workflow and moving annotators from coding to validating tasks cut the hours needed in half. 9 It is important to emphasize that these results are dependent on the expertise of the annotator. Another important insight is that beginner-level annotators are more likely to agree with the SML classification during validation, so there is a risk that false positives slip through the validation process. Having said that, the speed-up of the process is in part due to the binary classification problems that validators have to make in each round: instead of deciding if a certain article belongs to one of 21 categories, they only have to decide if the label assigned by the SML is correct or not. Our conclusion from the above results and this qualitative information is that the same allotment of annotator-hours yields better results if split between the creation of the training set and validation. But the application of the "human touch" also leads to different results when used as in-process validation vs. post-process validation. On the one hand, introducing human validation between classification rounds contributes to training set expansion and, even in the worst case of no new true positives, identified false positives could be held back from being added to the training set. On the other hand, all things kept equal, this would lead to an increase in the validation burden for the overall process or sap resources dedicated to ex-post validation (and would, therefore, lead to reducing the size of feasible projects). In the rest of the article, we examine the opportunities opened up by the inclusion of human validation between classification steps and introducing a number of indicators for measuring the cost-benefit performance of the way we make use of human labor. In short, we will be able to provide a more substantial answer to the question of optimizing the allocation of human working hours in our process.

Using alternative setups to gauge annotator efficiency
In this section, we compare four additional workflow setups (beyond the core SML setup discussed previously), all of which include sampling-based human validation steps between classification rounds. We will refer to the above core SML setup as the "machine only" baseline since it only relies on automated between-rounds training set expansion and no actual validation. That is, the newly added items to the training set could be either true positives or false positives, and gold-standard human validation is only applied ex-post. Figure 5 provides a summary overview of both the elements shared among all versions and the settings specific to each of the different setups. The employed classifying machine learning solution is the same for each setup, the already described SVM, and the one-vs-all binary decomposition-based ensemble (three samples times three models). Furthermore, to better capture the effect of the different workflow setups, we conducted five rounds of iterative classification and training set expansion.
Whereas the "machine only" setup, as in the measurements before, employs automatic intra-corpus snowballing, adding all newly classified texts to the training set for the next round of classification, the four other setups incorporate an element of human validation between each round of classification. For human validation, random samples are drawn from each class based on the number of classified texts. For classes with less than a hundred classified texts, all are selected. For classes with a higher number of classified elements, sample size is determined based on binomial (representing correctly and incorrectly classified texts) sampling rules so that the validation results allow for a ± 5% confidence interval with 95% confidence.
The four setups incorporating human validation built on each other in the following manner. The first version, titled "basic" relies on only adding identified true positives from the validation sample to the training set during the first two rounds; and then allowing for snowballing, that is also adding non-validated non-sample elements to the training set, from round three onward for major topics where the validation sample precision reaches at least 75% (the threshold we use as the standard inter-coder reliability for human coders).
The "enhanced basic" version also takes advantage of the additional information gained about identified false positives through the sample validation process. Once a document has been identified as having been falsely classified as belonging to some class, it will no longer be included in the test set when checking for that given class. Rather it will be included in the training set as a known negative for that class.
Third, the "stepwise balanced" setup is motivated by our goal of better optimizing the allocation of human labor in the process. Depending on the training and test set, some classes might reach very high levels of precision, even starting from the first round of classification. It makes no sense to keep validating samples for these classes, as opposed to snowballing them into the training set straight away. In this setup, we start out with the possibility of snowballing enabled from the very first round, but we employ a conservative 1 3 stepwise approach to lowering the precision threshold for intra-corpus snowballing: 85% in the first round, then 80% in the next round, and finally apply the previously used 75% baseline from round three onward. Finally, the "stepwise balanced plus active learning" setup adds an active learning component to deal with documents that are singled out by our classifying ensemble as being highly characteristic of multiple (most often two, sometimes three) classes at the same time. This is active learning in two senses. On the one hand, the elements to be queried are selected based on their obvious elusiveness for our classifying ensemble to handle, as they are the elements that are assigned multiple verdicts (these items have so far been disregarded in all of the previous setups in line with the description of our ensemble above). On the other hand, this is true active learning in the sense that these elements are not validated for the assigned multiple classes but are rather definitively classified by human experts. Figure 6 offers a summary overview of the precision and recall results for the five setups.

An evaluation of alternative setups: precision
While precision values are dominated by the results for the "basic" and "enhanced basic" setups, the "balanced stepwise plus active learning" produces the highest recall values for all major topics.
To understand these results in the context of what is happening in the five setups, let us turn to the discussion of precision first. The detailed numbers in Table 3 confirm that indeed the precision values for the "basic" and the "enhanced basic" setups are the highest, with the majority of major topic codes standing at a 100% precision rate. This, however, is due to almost all accepted classified elements undergoing human validation with very little actual snowballing taking place in these two setups. The "balanced" and "balanced + active learning" setups, where snowballing is more dominant, perform similarly well, with only 6 major topics below 95% precision and all of the topics above 92%. Compared to the "machine only" baseline, the hybrid workflow shows remarkable improvements across all specifications in terms of precision.
Adding an element of active learning to the classification rounds in the fourth setup not only increases the total precision of our results but also has a clear net benefit on all three of our cost-benefit indicators, as can be seen in Table 5 (with the added cost of the oracle function performed by the human experts on the articles selected for active learning-we return to this issue below). A careful analysis of our results, however, points towards an unexpected possible result, namely that the articles selected for active learning should not be added to the training set. Examining Table 3 again, we can see that some major topic classes (Macroeconomics, Civil Rights, Defense, and Other) do suffer a slight drop in precision.

An evaluation of alternative setups: recall
The recall by major topic category is much more heterogeneous than the precision, as Fig. 6 and Table 4 show. The differences between the "machine only" and the various hybrid setups is not as pronounced as it was the case for the precision results. There are 7 categories above 60% recall in the "stepwise balanced + AL" setup and 6 categories in the middle of the road "stepwise balanced" setup. Compared to the precision, the recall (both total and by topic) results are significantly lower. Nevertheless, these values still represent considerable savings in human coding for a text classification project, given the high precision of the workflow. Furthermore, we can see in Table 4 that the total recall score increases by 2.7 percentage points because of activating the active learning component. Adding 436 correctly classified articles to our set of newly classified articles ceteris paribus should increase our total recall score by 4.6 percentage points for an average unclassified corpus set size of 9516. This seems to indicate that while adding the correctly classified active learning elements does indeed have a gross positive effect on all our total indicators, the total change covers up a drop in performance for the results if we were to exclude the active learning elements. The most likely explanation for this phenomenon lies in the way the text of the articles selected for active learning is characteristic of more than one major topic. For example, an article on the privatization of hospitals would be hard to classify even for human experts choosing between "macroeconomics" and "health" based on the emphases and slight nuances of the given text. By adding these types of articles to the training set, we are, in fact, in a way lowering the quality of classification. Therefore, going forward, we would recommend still selecting out elements with multiple verdicts for human consideration. But the aim here would not be to add them to the training set per se, only to the final set of newly classified elements at the end of all classification rounds. This would, of course, change the meaning of this component: no learning takes place, as the information is not added to the training set. The reason we would nevertheless recommend selecting these elements out for human experts to check against the ensemble verdicts is because this type of decision between two (in rare cases three) major topic classes, like validation, is again potentially much easier than full multi-class classification.

Validation cost and snowballing gain
To better appreciate the costs involved in gaining additional precision, we introduce three indicators (see Table 5). The "total snowballing gain" (or automated training set expansion gain) is the number of all non-validated articles that are classified during the process. The "net snowballing gain" is the number of correctly classified texts among the total. Third, the "total validation cost" is the number of classified articles that were checked during the human validation phases of the classification rounds in each case. Note that "total validation cost" also includes the cost of checking false positives. We can see the drop in "total validation cost" from the "basic" to the "enhanced basic" setup, as the retention and use of information on identified false positives helps the classifying ensemble avoid making the same classification errors again. The drop in precision results as we move from the "enhanced basic" to the "stepwise balanced" setup (see Table 3) is precisely the result of more snowballing happening in the latter workflow. The net snowballing gain increases from 843 documents up to 2788 in the case of the "stepwise balanced + AL" setup. The average starting test set size (the documents to be newly classified) for our simulations is 9516, which means that 29.3% of the test set can be correctly classified automatically and accepted without human validation (while still yielding the exceedingly high precision results in Table 3) using the hybrid workflow coupled with active learning. In line with these gains the validation costs are also decreasing as there are less documents for the human annotators to go over, due to the retention of information on false positives, snowballing starting from round one and the separating out of articles for active learning. It is important to note that proportional validation costs and snowballing gain numbers do not scale linearly (due to the way the sample size is determined based on the binomial distribution), and thus the larger the corpus to classify, the greater the gains that can be reaped.
Furthermore, as indicated above, in the cases where human in-process validation is introduced into the hybrid setup, snowballing becomes tied to the human validation of samples. Therefore, hard to classify-that is, from a precision aspect: underperforming-major topics will not suffer from the problem of automatically adding large portions of false positives into the training set as is the case in the "machine only" setup. This effect is obvious in the way worst-performing major topics (Labor, Housing, and Public Lands) for precision in the "machine only" version all have perfect precision scores in the setups involving human validation.
The introduction of human validation of samples helps to differentiate between the easy and hard to classify major topics. These benefits, of course, will vary from corpus to corpus. But based on this data, our expectation is that in-process validation will lead to savings on both human training set creation and overall (in-process and ex-post) validation work, while at the same time ensuring that texts belonging to problematic major topics are treated with due diligence and are only accepted after human validation.

The application of hybrid classification with limited computational capacity
So far, we have discussed our results from the viewpoint of the optimal allocation of human labor. Nevertheless, our results also provide useful information for use cases where computational capacity is expensive or limited. In Table 6. we summarize the total number of single verdicts assigned by the classifying ensemble each round for the different setups. It is important to remember that-depending on the setup used-not all single verdicts end up becoming newly classified elements moved to the training set. What this table demonstrates is the speed of convergence with which the workflow reaches a state where the classifying ensemble no longer assigns verdicts to the still unclassified elements in a significant quantity (and for the very last steps, the precision of those results also drops too low for the process to be worth pursuing any further).
In our measurements (based on the setup discussed in the section baseline simulations), the workflow consisted of three rounds, and the acceptability of this cut-off point is further buttressed by the measurements for the current group of simulations. Except for the "basic" version, all other setups reach their convergence point around the third to fourth round, depending on the cut-off criteria. This is a very important result for projects with a bottleneck in computational capacity. Three rounds of the hybrid workflow using the "stepwise balanced" setup, for example, are sufficient for reaping the benefits of the machine classification process within a hybrid approach.

Conclusion
The goals of this article were twofold. First, our aim was to investigate the external validity of the hybrid workflow approach to classify a large body of news documents. Second, we set out to investigate if such a hybrid ensemble setup can be used in a production setup to save on human annotating capacity by placing the human touch where it is most needed. In our research design we considered four algorithms based on their prevalence in the literature and our initial test results: Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and LASSO. We proceeded with NB and SVM due to the fact that (1) NB is widely used in the literature (regardless of its actual performance) and (2) SVM looked to offer better results than LASSO.
As for the first goal, we showed that both the NB and SVM classifiers perform better than the base NB approach on the New York Times corpus. To have the best possible comparison, we replicated the SVM workflow with our English language corpus and found that the workflow yielded comparable results compared to what Sebők and Kacsuk (2021) reports and exceeding the results reported by Loftis and Mortensen (2020). These results also highlight and reinforce the superior performance of the SVM method compared to the Naïve Bayes and the (so far often overlooked) power of using ensemble SML approaches.
Second, we provided evidence that such a hybrid ensemble setup can be used in a production setup to achieve considerable savings in human annotating needs. The four variations on the workflow using different degrees and types of human-machine interaction showed that carefully balancing human and SML efforts can maintain high precision and also save costs. In our simulations the average starting test set size surpassed 9000, and on a corpus of this size we could generate valid machine codes-with no human involvement-for almost 30% of the test set without any meaningful sacrifice to high precision.
This allows research projects to shift resources to address production bottlenecks or complete projects ahead of schedule. Another key advantage of the modular structure is that if an in-process human validation element is incorporated (as we demonstrated above) it provides the best of both worlds: the snowballing process can use categories where the SML classification is very accurate, and human validation can be applied to categories where the SML classifier struggles.
We believe that the methods demonstrated above showcase a robust and modular approach that can be applied in many text classification contexts (not limited to the CAP codebook). Further testing of the workflow might include using the hybrid workflow with different coding schemes and across domains (although one cross-domain application, language, is already underway as we presented our data in comparison to other projects).
While the results show that the hybrid workflow and the SVM voting ensemble perform exceptionally well with unbalanced classification tasks, further robustness checks might look into applying it to different coding frameworks and see how well these results travel across research domains. Similarly important is the fact that adopting this workflow approach can speed up large text classification projects considerably, and cost savings allow projects to use ML as an efficiency booster in planning and implementation.