Introduction

To study online media content, researchers use methods of text classification to analyze large volumes of text data [1]. Text classifiers using supervised machine learning can be adapted to new classes and texts without modifying the algorithm, requiring an annotated training dataset only [2]. However, such training datasets are often not available for a certain class or topic of interest and a custom dataset needs to be manually annotated.

When annotating texts, the generated classifier’s accuracy should increase with every additional text sample [3]. However, statistically each additional text increases the accuracy less than the previously added text, because of the asymptotical shape of the learning curve [4]. Therefore, annotating more texts decreases annotation efficiency. To minimize human annotation effort, an optimal-sized training set that provides the best trade-off between classification accuracy and manual effort should be annotated and the text classifier with the highest expected accuracy should be selected.

A major problem is that experimentally evaluating a multitude of text classifier designs for accuracy and selecting the classifier with the highest accuracy will result in overfitting of the selected classifier [5]. Consequently, both the training set size and the text classifier design should be pre-determined on empirically tested guidelines to avoid the necessity of model selection, which would require further annotation effort to create out-of-sample test data for estimating unbiased classification accuracy [6]. The objective of this work is to empirically work out a baseline recommendation for practitioners and researchers for designing as accurate as possible text classifiers given very limited resources for creating training datasets.

Previous work has concentrated on optimizing text classifiers for large datasets with more than 1000 texts or estimating accuracies for classifiers on one small dataset. Both of these approaches cannot be generalized to other small datasets, because the former approach does not take the effect of the training set size on the chosen text classifier into account and the latter may suffer from random errors in accuracy estimates due to the small training and test set size. Few studies have altered the training set size and estimated the effect on the accuracy of the classifiers. These studies focused on comparing Support Vector Machines (SVMs), Naïve Bayes (NB) and other machine learning algorithms and did not evaluate the design factors of the feature vector. Moreover, these previous studies do not indicate the required dataset size that is necessary to train an accurate text classifier. Previous work concerning learning curves has provided methods to estimate the shape of the learning curve by fitting an inverse power law model to accuracy estimates. These approaches require that such a dataset is already available and are therefore not useful for researchers that intend to analyze new data.

We contribute by proposing a guideline for online media researchers and practitioners for designing text classifiers and efficiently creating custom datasets. We select a baseline classifier design based on empirical experiments using a four-way full-factorial, repeated-measures design with 32 design factors and 22 training set sizes. Furthermore, we quantify the effect of training set size on classifier accuracy. We find that a small dataset of 300 documents provides high accuracy and adding more training examples rarely substantially increases the classification performance.

Related Work

Dictionary-Based Approaches for Text Classification

Dictionaries might be chosen for small dataset sizes because they require no training set. The disadvantage of dictionary-based classifiers is that they are often not directly evaluated for the classification problem due to the lack of a labeled dataset [7]. The improvement of the dictionary is difficult due to the available words that might be added to a class, and the classifiers using dictionaries usually achieve lower accuracies [8]. A reason for the low accuracy of dictionary-based approaches is that these dictionaries are either developed for a broad application (e.g., General Inquirer [9]), achieving only rather low classification accuracy for a specific domain, or are specific dictionaries (e.g., dictionaries designed by Henry [10], Loughran and McDonald [11]) that can be applied only in a specific context. Furthermore, dictionary-based classifiers use equal weighting for each word in the dictionary because information regarding the importance of dictionary words is missing.

Effects of Training Set Sizes on Design Factors

A large body of literature improved the accuracy of text classifiers on large datasets, which typically contain over 2000 examples [12,13,14]. However, the training set size can play an important role for the decision of the optimal text classifier configuration [4]. The approach to select a classifier that performs well on large training set sizes neglects the effect of the training set size on the performance of the classifier, which might result in selecting the wrong classifier for the small training set [4].

Similarly, text classifiers are evaluated on smaller domain datasets in several studies [15], although accuracy estimates on small datasets suffer from random errors and results of individual experiments for one dataset might not generalize to other datasets. Therefore, one classifier might perform better than the other classifier just by chance, which might lead to contradictory results. Additionally, it cannot be estimated from such studies how big the effect of a specific training sample is on the accuracy, which is important to estimate the amount of training samples that need to be annotated.

Furthermore, there are pre-trained deep learning models like BERT [16] and ULMFIT [17] that can be fine-tuned to small datasets using transfer learning. Previous work has shown that these approaches can achieve higher accuracy than SVM text classifiers for small datasets [18]. However, these approaches require larger amounts of computational resources for training and application. For example, Usherwood et al. [18] state that BERT-base requires 12 GB of VRAM. Furthermore, the inference time of BERT-base is slower than the inference time for SVMs [19]. Therefore, substantial computational resources would be required if BERT-base is used to analyze large online media datasets [18].

Previous work on text classification has studied feature selection for small datasets by comparing feature selection methods for SVM, k-nearest neighbors (KNN) and NB classifiers on ten datasets [20]. Although this work helps to identify better feature selection methods for small datasets, classifiers were only compared on a fixed training sample of 1000 documents. Therefore, this previous work provides no information on the effect of training set size on selecting a text classifier for smaller datasets.

Furthermore, the effect of training set size on SVM and NB has been compared on a twitter dataset consisting of 4269 training and 782 test examples [21]. This study compares training set sizes from 10% (427 examples) to 100% (4269 examples) in 10% steps. Resampling of the training set was not applied and therefore the standard deviation of the accuracy is not available and random errors might affect the reported accuracy estimates. Furthermore, this study includes only one dataset with tweets and results on datasets with larger documents might vary.

Another work compared training set sizes on SVM, Multinomial NB and Decision Trees for the training set sizes of 50–500 examples in steps of 50 examples [22]. They compare their classifiers on four sentiment datasets. However, they use unigrams as text representation and term frequency as term weighting approach. They do not experiment with different term weightings and text representations.

The related work highlights an important gap in literature, i.e., identifying the optimal design factors for text classifiers for small training set sizes. The model selection should be conducted on several large-scale datasets to support the generalizability of the reported design factors. Additionally, the training sets should be resampled several times to reduce random errors and the standard deviations of the mean accuracies should be reported to identify the impact of a randomly annotated training sample on the accuracy.

Effect of Training Set Size on Accuracy

The effect of training set size on the accuracy of a text classifier can be represented by the learning curve [4]. The learning curve shows the relationship between expected performance and the number of training examples. The literature on learning curves for machine learning classifiers has highlighted that the test error is higher than the training error and both asymptotically reach a common value with increasing training set size [4]. The learning curve may be used to estimate the sample size that is necessary to obtain a specific minimal performance by fitting an inverse power law model to accuracy estimates using a small training set [23]. This approach supports the decision for choosing if more data should be annotated. However, it does not provide information on the necessary training set size if no training set is available yet, because organizing the collection and annotation of a random training sample requires already a large effort. To be able to decide the viability of such an effort, the best configuration and classifier must be selected based on prior studies [24]. For this purpose, it is necessary that on several text classification datasets the performance of the text classifiers is reported and that, if possible, a training set size is identified that achieves high accuracy.

Research Model

Figure 1 shows an overview of our research model. The dependent variable is classification performance measured by accuracy. Accuracy is the percentage of all documents in the test set that were classified with the class that matches the annotation of a human annotator [25]. The independent variables are grouped in design factors and training set size, which are described as follows.

Fig. 1
figure 1

Research model

The main task of text classification is based on a set of documents D = {d(1),.., d(n)} and a set of classes Y = {1,..., m}, whereby a given document \({d}^{(i)} \in D\) is assigned a label \({y}^{(i)} \in Y\) [13, 26]. Generally, a class can be any conceptual entity and the number of classes could thus be arbitrarily high. However, no more than 20 classes were used in most previous datasets [14, 27].

The feature vector and the machine learning algorithm are our main design factors. The feature vectors are the input for the machine learning classifier. The feature vectors are constructed in the following text classification pipeline. First the input document d(i) is converted into the text representation, where n-grams (i.e., words or word sequences) contained in documents of the training set establish the dimensions xj of the feature vector x(i). Second, the applied feature weighting approach calculates the values xj(i) in the feature vector. Third, the feature vector x(i) is used as the input for the machine learning algorithm that estimates the label y(i). Prior to the application, the machine learning algorithm was trained on a training set that is composed of human-annotated examples. In the following we describe the three design factors that have been evaluated on the ACL IMDB dataset in our previous work [28].

Text Representation

Text representation describes how the document will be represented in the feature vector x(i) for the machine learning algorithm [25]. Typically, n-grams are used with n not largely exceeding 3 [29]. However, due to the scale of our experiments, we limited the design factors to uni- and bi-grams, which have shown high accuracy in previous work [12]. Unigrams: each term is a feature, regardless of its arrangement and location in the text, e.g., [’the’, ‘new’, ‘Spielberg’, ‘film’, ‘is’, ‘all’, ‘good’]. Bigrams: two sequential terms are a feature, e.g., [’the-new’, ‘new-Spielberg’, ‘Spielberg-film’, …]. In our experiments we compare unigrams to a combination of uni- and bi-gram features. We consider that adding bigrams to unigrams features will increase the accuracy [12, 30]. The argument for adding bigrams to the feature vector is that bigrams allow for representing phrases in the feature vector [28].

Feature Weighting

This design factor defines the values in the feature vector. A three-letter code allows for convenient reference of each feature weight combination, e.g., ntn, bnn [31, 32]. Table 1 defines the formulas that are denoted by each letter of the code [28]. The code for the baseline feature weighting approach is indicated by n. For instance, nnn determines the absolute term frequency (tf), whereas ntc references the term frequency–inverse document frequency (TF–IDF), and then normalizes the vector to unit length with L2 normalization. The L2 normalization is calculated on the complete feature vector after the other feature weighting operations are calculated. Therefore, \(\mathrm{ntn}\) is calculated as follows with N being the total number of documents and df being number of documents that contain the feature:

Table 1 Feature weighting components
$$\mathrm{ntn}=tf \times \mathrm{log}\frac{N}{df}\times 1$$

For the term frequency component, the binary representation (bxx) of features in the document has increased performance compared to absolute term frequency (nxx) for sentiment classification [33, 34]. A possible explanation is that word frequency per document has only limited impact on the sentiment of a document, and that the occurrence of features is more important [34]. The reason for applying inverse document frequency (xtx) stems from Zipf’s Law, which states that few words occur often, whereas most words occur seldom [35]. Common words do not help in discriminating documents. Weighting features using IDF will decrease the values in the feature vector of common words [36, 37]. The argument for L2 normalization (xxc) is due to differences in the number of words per document. Then, shorter documents are represented by feature vectors with a lower L2 norm, while longer documents are represented by feature vectors with higher L2 norm vector length [31]. Dissimilar vector length potentially reduces classification accuracy because documents with similar content but different length will be represented differently. Therefore, inserting a normalization factor into the weighting formula can increase accuracy [34, 38].

Machine Learning Algorithms

Machine learning algorithms use annotated training data to learn a classification model for the application to unseen input documents. Linear-kernel Support Vector Machines (SVMs [39]) are frequently used and achieve high performance on text classification tasks with larger documents [12, 13]. An NBSVM is an SVM that uses Naïve Bayes (NB [40]) features and has been shown to achieve higher accuracies than SVM [12].

Training Set Size

Additional training examples increase the expected accuracy of the machine learning classifier. However, statistically, each additional example increases the accuracy less than the previous example because of the asymptotic shape of the learning curve [4]. Therefore, for each dataset, we generated training sets sizes of 50–1000 examples in intervals of 50 examples. Furthermore, we added training sets with 2000 and 10,000 examples for reference. We were mostly interested in the smaller training set sizes, but from a practical perspective the intervals of 50 examples seem sufficient, because annotating 50 documents can be achieved in a reasonable time frame. Each of these training sets were stratified, i.e., the same relative number of documents per class were present in the training set.

Method

Experimental Setup

The experimental setup is described in the following. Our experiment had a four-way factorial repeated-measures design with accuracy as the dependent variable. Treatments were carried out by combining 22 training set sizes and 32 design factor combinations, i.e., 2 text representations, 8 feature weightings and 2 machine learning algorithms. The 22 × 2 × 8 × 2 factorial experiment allowed us to compare results obtained for a total of 704 treatment conditions for each of the 7 datasets, which resulted in 4928 combinations in total (design factors × training sets × datasets). Each of these 4928 combinations were repeated 20 times with resampled training sets resulting in 98,560 experimental runs. The 20 repetitions for each combination were used to reduce random errors for each combination and to obtain a standard deviation for each combination. In the machine learning experiments the hold-out method was used to evaluate the performance of the classifiers. The training and test set necessary for each hold-out method were generated as follows. The resampled training sets used in our study were drawn from the training set of the respective dataset. The test set to estimate the accuracy using the hold-out method was always the full test set of each dataset [14, 27].

Datasets

The datasets used in our study are described in Table 2. This table shows the different domains and problem types of the used online media datasets as well as their test and training set sizes. The first dataset is the ACL IMDB dataset by Maas et al. [27]. The other datasets are large-scale datasets that have been created by Zhang et al. [14] during an evaluation of deep learning algorithms, which require large training sets. The datasets analyzed during the this study are publicly available in online repositories.Footnote 1

Table 2 Datasets

Table 2 indicates that there are various numbers of classes, domains, and classification types. All datasets contain a large number of training examples and were generated automatically or by the authors of the document, e.g., the ACL IMDB dataset classifies polarity by linearly mapping the 10 star rating {0, 1, … ,10} to sentiment polarity [27]. Manually annotated datasets are generally much smaller due to their expensive annotation process.

The datasets were used as they were provided in the original studies. This includes but was not limited to: no cleansing procedures (e.g., no stop word removal) and no preprocessing (including no stemming or lemmatization). This is because the results of such cleaning procedures are context specific and may have a negative impact on the generalizability of our results for some classification tasks. We did not apply feature selection, because it comes with various hyperparameters, which are out of scope for this study. Furthermore, the effect of feature selection on accuracy is not finally determined if SVMs are used due to the built-in regularization term of the SVM, which has a similar effect as feature selection [37, 41].

Machine Learning Configuration

We applied the machine learning algorithms with their default configurations as provided in the scikit-learn machine learning library [42]. The SVM implementation by scikit-learn used LIBLINEAR, which is a publicly available SVM implementation [43]. We applied the default configuration (L2-regularized and L2-loss dual-form SVM with linear kernel, penalty C = 1, and margin of tolerance ε = 0.01). Similarly, we used the default configuration of the NBSVM algorithm (with β = 0.25) [12]. We used the hold-out method with the complete test set of the datasets to calculate all our accuracy estimates [44].

Results

Model Selection

In the following, we analyze how different design factors affect the accuracy for both small and large training set sizes. We defined the following groups of training set sizes: small training sets consist of 50–500 training examples, large training sets consist of 550–1000 examples and, additionally, training set sizes of 2000 and 10,000 training examples are grouped individually (see Fig. 2).

Fig. 2
figure 2

Difference of mean accuracies for the design factors averaged over all datasets

Averaged over the seven datasets, the effects on accuracy for the design factors were in positive direction for all training set sizes except for IDF using a training set size of 10,000 examples (see Fig. 2). Additionally, Fig. 2 indicates that for NBSVM, binary and IDF, the effect on accuracy is inversely related to the training set size. Using uni- and bigrams always increased the accuracy compared to using only unigrams and the effect increases with training set size. Furthermore, we found that applying L2 normalization to the feature weights has the largest effect on accuracy. The effect seems to increase with more examples.

However, Fig. 2 averages the results over several datasets and it is likely that some of the design factors do not only depend on training set size but also on the dataset. To check whether the effect of the training set size was stable for the seven datasets used in this study, we analyzed the results for each dataset individually for different training set sizes (Fig. 3).

Fig. 3
figure 3

Difference of mean accuracy for the design factors for the datasets used in this study

Figure 3 depicts the results for different training set sizes and datasets. NBSVM overall increased the performance for all datasets except for DBPedia. Bigrams had a negative effect on AG’s News and Yahoo! Answers, but increased the performance for all other datasets. However, with increasing training set size the effect for both datasets turned positive eventually. The effect of binary features (bxx) decreased in larger training sets and was negative for AG’s News Yahoo! Answers. L2 Normalization (xxc) had an overall positive effect on all datasets. IDF had a positive effect for small training set sizes. However, the positive impact of IDF (xtx) degraded and became negative or non-existent with increasing training set size. For training set sizes exceeding 2000 examples Yelp Review Full and Yelp polarity were both negatively affected by the application of IDF term weighting. To sum up, these results indicate that most factors were mainly affected by training set size.

Figure 4 displays changes of the effect of a design factor with respect to training set size and dataset. Figure 4 indicates that the direction of the effect for bigrams, visible at the top-right in Fig. 4 for Yahoo Answers changes with more than 2,000 training examples. Similar effects can be observed for IDF (Fig. 4, on the bottom-left).

Fig. 4
figure 4

Difference of mean accuracies for the design factors. The x-axis depicts the training set size in logarithmic scale

Given the previous results, we analyze if there are any interaction effects between the three design factors. For this purpose, we provide tables containing all factor combinations for small (Table 4), large (Table 5) and all (Table 7) training set sizes and a training set size of 10,000 examples (Table 6) in the appendix of this study. These tables indicate that there are most likely no large interaction effects that would make the combination of all factors disadvantageous. Furthermore, Tables 4, 5, 6 and 7 indicate that the combination of all factors (i.e., uni- and bigrams, NBSVM, btc) achieves in most cases the highest accuracy, making further interaction effects irrelevant, because we are only interested in the most accurate factor combination. Therefore, for small datasets we select the design factor combination of uni- and bigrams, NBSVM and btc. Datasets with more than 1000 examples might use the same factor combination except for the term weighting ntc and datasets with more than 10,000 examples could use the term weighting nnc.

Training Set Size

Figure 5 depicts the accuracy for the selected factor combination for small datasets (uni- and bigrams, btc and NBSVM). The standard deviation of the accuracy is displayed by means of the error bars on each data point over the 20 repetitions per experiment. Figure 5 indicates that for the provided design factors a dataset with more than 300 examples increases the accuracy only moderately. Note that the number of classes of the dataset does not have a large impact for the accuracy improvement even though datasets with more classes contained fewer documents per class. For example, the DBPedia dataset consists of 14 classes and did not benefit from using more examples, but the Yahoo Answers dataset, consisting of 10 classes, might benefit slightly from more examples. However, in both cases the effect was rather small. Additionally, Fig. 5 indicates that the standard deviation for the training set size is small.

Fig. 5
figure 5

Accuracy for the factor combination bigrams, btc and NBSVM. The error bars depict the standard deviation of the 20 repetitions

The following Table 3 provides the averaged accuracies from Fig. 5 in tabular form. The last column “all” displays the accuracy estimate of the selected model trained on the complete training set as displayed in Table 2. Note that for each dataset the training set size varies. Table 3 shows that for each dataset the individual training set sample had only a minor impact. For only 100 training examples the standard deviation was below 3% for each of the 7 datasets. This result indicates that annotating only a small dataset will already help to predict the accuracy of a model trained on a larger dataset.

Table 3 Average M and standard deviation SD of the accuracy both in percent for the factor combination uni- and bigrams, btc and NBSVM

Discussion

Findings

In general, our results indicate that increasing the training set size above 300 examples has only a minor effect on increasing a classifiers’ accuracy for most datasets. Furthermore, this effect seems to hold independent of the number of annotated classes in the dataset. The findings are largely consistent across different datasets, although some interaction effects with the training set size were found:

  • NBSVM improved the accuracy in most cases. Increasing the training set sizes reduced the positive effect of NBSVM.

  • Bigrams increased the accuracy in most experiments. The positive effect of bigrams on accuracy increased with increasing training set size. This effect can be explained by model complexity [45]. First, many of the additional features that bigrams provide can only be utilized by larger datasets, because bigrams are less likely to occur. Second, bigrams increase the number of features, which results in the necessity to fit more weights of the machine learning algorithm. Therefore, using bigrams requires larger training sets to reduce overfitting of the additional weights.

  • Binary features (bxx) increase the accuracy for small training sets, but should have no effect for large training sets. Similar results have been found in previous work for sentiment classification [34].

  • IDF (xtx) has a positive effect on small training sets but the effect diminishes for large training sets. IDF increases the weight of terms that appear in fewer documents.

  • L2 normalization (xxc) of the feature weights had consistently a positive effect on the accuracy. These results suggest that normalization should be applied in all text classifier designs.

Limitations

The generalizability of our results is subject to limitations. First, note that our results are based only on seven datasets and the labeling was generated by the authors or automatic processes, which is not the same as manual labeling from non-author annotators. Therefore, our results might not generalize to all manually generated datasets.

Second, for each design factor we compared all experiments including the factor and all experiments not including the factor to measure the effect on accuracy of the specific design factor. This experimental approach implies that design factors do not interact among each other. Although this is a naïve assumption, further work could study several effective design factor combinations that we already found like btc, ntc and bigrams to check for interaction effects among them.

Third, in this context another remark must be made upon our findings for the NBSVM classifier. All prior applications of NBSVM were restricted to a particular feature weighting schema (bnn) [12]. We acknowledge that the NBSVM algorithm might be tailored to this schema, which in turn could explain some of the interaction effects.

Fourth, we did not apply deep learning in our experiments, even though deep learning has recently generated breakthroughs in the fields of text understanding and conversational AI (e.g., the breakthrough results by BERT [16, 46]). Furthermore, the deep learning approach ULMfit has been suggested for text classification [17]. ULMfit uses transfer learning to improve text classification performance. We did not integrate ULMfit or BERT in our experiments. At the time of writing, the added complexity and computational resource requirements to use these methods seemed not practical for many social media researchers and practitioners. Therefore, we focused on simpler, but yet effective approaches.

Implications

The implications of the laid-out limitations regarding future work are as follows. First, further work may study how much training data for transfer learning is necessary to achieve certain levels of classification accuracy, especially if BERT or ULMFiT is used. A focus on finding the right number of training epochs and fine-tuning parameters should be made, because in a short experiment we used the BERT-base modelFootnote 2 and could not achieve high accuracy for a training set size of 100 or 300 examples for the ACL IMDB dataset using reported parameter configurations [18].

Second, further work is needed to fully understand the interactions between training set sizes and classifier design factors explored in this work. We found some interactions between design factor and dataset (e.g., for term presence and uni- and bigrams). However, we could not pinpoint the dataset characteristics that lead to these interactions. None of the characteristics of the dataset that were known to us (i.e., source, classification type, domain, or number of classes) seemed to be related to these interactions. Understanding these interactions would allow for small training sets, while increasing the classifiers’ accuracy further.

Third, as pointed out in the limitations section above, we assume linear relationships between the design factors. We believe there is potential to further improve the classification performance by further studying interactions among effective design factors such as the different feature weighting schemes.

Fourth, in another study the interactions between accuracy and feature selection for large datasets between 10 thousand and 1.6 million tweets were investigated and it was found that common feature selection algorithms can play a role to increase accuracy [41]. Additionally, feature selection is the recommended approach by practitioners and researchers [48]. In contrast, previous work also indicates that the performance tends to increase if more features are used [47]. It would be interesting to investigate if feature selection can improve the accuracy for small datasets using our experimental approach [37].

Fifth, the already high accuracy for more than 300 examples is consistent with previous research that used a gene expression dataset [49]. Therefore, additional research could investigate whether the high accuracy for more than 300 examples is a more general phenomenon.

Conclusion

This paper reports an experimental study, examining the design factors that affect the accuracy of machine learning text classifiers for small, manually annotated datasets. We contribute insights on how text classifiers should be designed and how training sets should be sized to both achieve high classification accuracy and to also minimize the amount of human labor required.

We observed several interaction effects between design factors, training set size and dataset, which corroborate the need for further research. However, we find that overall, the theoretical design factors for machine learning-based classifier design generalize well among different training set sizes and datasets. Online media researchers and practitioners can use this knowledge as guidance for more efficiently designing custom datasets and they can readily use our proposed baseline design factor choice for their text classifiers. Thus, researchers and practitioners can reduce human labor and increase the accuracy of their classifiers without setting up a large number of experiments.

As a baseline for classifier training on small datasets, we recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM as the machine learning algorithm. Our results suggest that a manually annotated training set may contain only 300 examples and still achieve high accuracy. Accuracy could be measured by cross-validation to avoid an additional dataset. Additionally, one might measure the performance at smaller training set sizes to get a first indication on the feasibility of the pursued classification task, because the standard deviation of the accuracy for different training set examples is rather small even for small training set sizes.

Our experiments also indicate that the number of classes has a minor role for the relationship between training set size and accuracy, which is surprising, because the number of examples per class is lower for datasets with more classes, given equal training set size. However, further research is required to study the effect of the number of classes on the accuracy.