We designed a study to answer three distinct but related questions: (1) Is ML more accurate than humans at classifying scientific abstracts? (2) Is ML more reliable than humans at classifying scientific abstracts? (3) To what extent does human classification performance improve, relative to ML, through (i) increased task training, (ii) increased prior knowledge, (iii) selection on past performance, and (iv) feedback? In this study, accuracy means “classifying an abstract correctly to its true discipline group”, while reliability means “two classifiers classifying an abstract to the same discipline group, whether or not this classification is correct”. Accuracy and reliability are not necessarily related, although as any group of classifiers approach perfect accuracy then trivially they also approach perfect reliability.
We use the abstracts of European Research Council (ERC) Starting Grant (StG) funded projects that were accepted between 2009 and 2016 inclusive. The ERC evaluation panel structure has been stable since 2008 (European Research Council 2019a). For the purpose of our study, the existing panel classifications are considered the ground truth that we do not question. ERC grant applicants initially select the panel they apply to. While panel chairs may re-assign an application to a different panel, and may consult members of other relevant panels, each application is nevertheless assigned to a single panel (European Research Council 2019b).
As we are primarily interested in the classification of natural sciences abstracts, we focus on the 19 physical and life sciences panels, comprising 2523 abstracts in total.
Table 1 presents these panel codes and titles.
Our study has four stages. In the Undergraduates stage, we recruited 63 undergraduate student assistants from Nanyang Technological University, a major research university in Singapore, for a full-day task. We sent out an email to recruit undergraduates for our research abstract classification study to all undergraduates via the university’s mailing lists. To screen potential classifiers for aptitude, we required applicants to complete a short example task to classify two research abstracts. Just over one hundred undergraduates responded to the email and completed the example task. While we made every effort to recruit each applicant, 63 eventually reported for work on the day of the study. The task was conducted at a classroom on campus, where they were given one of four training sets of abstracts in the morning. Each training abstract was labelled according to the existing ‘ground truth’ ERC evaluation panel, allowing assistants to study how the abstracts ought to be classified. In the afternoon, they were given a test set of different abstracts, with the ERC evaluation panel labels removed, and were told to assign each abstract to the ERC panel most likely to match the ‘ground truth’. To ensure that their performance is due to learning from their training set only, we disallowed peer discussion and internet use. To incentivise both performance and completion, they were compensated with a flat rate of Singapore $120 for a day of work plus a variable amount of $4 for every 10 abstracts in their test set that they correctly classify. In this stage, the average amount paid was $163, which was paid in cash.
In the second stage, termed the high-performance undergraduates stage, we retained eight undergraduates from each training set group with the highest accuracy scores. These undergraduates also had to be willing to continue with the study for up to two more stages. Over email, we gave these high-performance undergraduates another test set to classify without further training. This stage is meant to answer question (3)(iii) above about selection effects for human classification performance. After they returned the completed test sets, we began the third stage, termed the high-performance undergraduates plus feedback stage. We gave the high-performance undergraduates feedback on the actual ‘ground truth’ classifications of the abstracts they had classified in the previous two stages. Then we gave them a third test set to classify. This stage is meant to answer question (3)(iv) above on the effect of feedback on human classification performance. As before, we instructed the undergraduates to refrain from discussing or using the internet while classifying the test sets. The undergraduates were compensated with a flat rate of $150 for completing both test sets plus a variable amount of $6 for every 25 abstracts that they correctly classify. In this stage, the average amount paid was $207, which was paid in cash.
In the last postgraduates stage, we recruited 26 Ph.D. students and postdoctoral researchers in STEM disciplines (postgraduates) from Nanyang Technological University for a half-day task where they were given a test set to classify without any training or task exposure. We sent out an email on our recruitment of postgraduates for our research abstract classification study to all postgraduates in the College of Engineering through the assistance of the administrators in each School in the College. Nearly half of the Postgraduates were from the Engineering sciences. The task was conducted in a classroom on campus, and as with the undergraduates, discussion and internet use was not allowed. The postgraduates were compensated with a flat rate of $80 for completing the test set plus a variable amount of $10 for every 50 abstracts that they correctly classify up to a total of $120. The average amount paid was $98, and was paid in the form of $10 vouchers (rounded up) for a large national supermarket. This stage is meant to answer question (3)(ii) above on the effect of prior knowledge on human classification performance. Additional details on the study participants are in the “Appendix”.
Test and training sets
Test and training sets are generated by stratified random sampling where an equal number of abstracts were sampled from each panel. From a pilot trial, we found that undergraduate classifiers were able to comfortably complete about two to three hundred abstracts in half a day. Hence, all our test sets consist of 247 abstracts, or 13 abstracts from each evaluation panel. In the undergraduates stage, all undergraduates were given the same test set to classify. In subsequent stages, we designed the test sets so that each human classifier would face a unique test set consisting of a common component of 95 abstracts (5 abstracts from each panel) and an individual, independently sampled component of 152 abstracts (8 abstracts from each panel). The common component addresses question (2) above about whether human or ML classifiers are more reliable. The individual, independently sampled component allows us to ensure the results are robust to idiosyncrasies in the common component abstracts. This is especially important for addressing question (1) on comparing ML accuracy to that of human classifiers, since any given ML model, once trained, will always produce the same classification output in response to the same test set. A variety of test sets is necessary to provide a more robust estimate of ML accuracy.
From the pilot trial we also found that undergraduate classifiers were able to comfortably study several hundred abstracts in half a day. Hence, to address question (3)(i) above about the effect of more training on human classification performance, in the Undergraduates stage we generated two large training sets with 380 abstracts, and two small training sets with 190 abstracts, and randomly assigned undergraduate classifiers to each training set. To generate the four training sets, we first created 20 randomly sampled sets of abstracts—10 small, and 10 large—and trained an ML classifier on each set. We then scored each trained ML classifier on the Undergraduates test set, and chose the four training sets that produced the best and worst performing ML classifiers, in the small and large training sets respectively. Table 2 summarizes the number of classifiers and the sizes of the training and test sets given in each stage.
We use the support vector machines (SVM) algorithm as we found in Khor et al. (2018) that it has the best abstract classification performance among the basic supervised classification algorithms. The SVM algorithm finds the optimal hyperplanes that bisect the data to an “In” and “Out” classification for every category using a maximum-residual criterion (see Cortes and Vapnik 1995 for a detailed explanation). We combine SVM with bag-of-words pre-processing of the abstracts and use text frequency-inverse document frequency (TF-IDF) as our feature score (see Baeza-Yates and Ribeiro-Neto 1999 for a detailed discussion about information retrieval). For hyperparameter optimisation, we use grid search with cross-validation due to its ease of implementation. For an extended discussion of hyperparameter optimisation in machine learning, see Bergstra and Bengio (2012).
For a fair comparison of classification performance between human and ML classifiers for question (1) above, the training for our ML classifiers must be restricted to the same amount of training given to the human classifiers as reasonably as possible. In the undergraduate stages, we train an ML classifier using each of the four training sets. Undergraduate classifiers from each training set group are then compared only to the performance of the ML classifier that had been given the same training set. In the Postgraduates stage, no training sets are given to the postgraduate classifiers as their doctoral training is taken to be an extensive period of training in the knowledge and disciplinary boundaries of Science. To simulate extensive background training, ML classifiers in the Postgraduates stage for each test set are trained using all other abstracts that were left out of the test set (2276 abstracts in total).
While there are many measures of accuracy in classification problems, precision and recall are most often reported (Sokolova and Lapalme 2009). Precision is the ratio of true positives to the sum of true positives and false positives. Recall is the ratio of true positives to the sum of true positives and false negatives. Positive and negative refer to whether an abstract is classified into a given evaluation panel or not. Intuitively, precision says what proportion of our classifications are correct and recall says what proportion of the actual abstracts we classify correctly. Because both are important measures of accuracy, their harmonic mean, the F1 score, is our preferred accuracy metric. As precision, recall and F1 are defined only for a 2 × 2 confusion matrix, the overall precision, recall and F1 of a test set is the mean of the scores across all evaluation panels.
The reliability of a group of classifiers is also known as inter-rater reliability (IRR), which measures to what extent different classifiers tend to classify the same abstracts to the same evaluation panel. Reliability does not measure whether classifications match the ground truth, only whether different classifiers agree on the same classification. Reliability is measured with Fleiss’ κ (1971) as our data contains more than 2 classifiers per test set. κ has an upper limit of 1, which represents perfect agreement, while 0 implies that the agreement rate is no better than pure chance. Negative values of κ imply disagreement beyond what would be expected by chance alone. For interpretation of κ, Landis and Koch (1977) proposed the following scale: κ > 0.4 is “Moderate” agreement, κ > 0.6 is “Substantial” agreement and κ > 0.8 is “Almost Perfect” agreement. For a detailed discussion of IRR, refer to McHugh (2012).
We exclude sets where the human classifier failed to complete at least 95% of the abstracts in their test set. 1 set in the undergraduates stage and 1 set in the high-performance undergraduates stage were excluded thus. We also excluded one human classifier who had 89% and 97% accuracy in the two High-Performance undergraduate stages. The extremely high performance of this classifier, both relative to their own prior performance and to that of other classifiers, suggested use of the internet (where all ERC abstracts and their evaluation panel assignments are searchable). This classifier’s data is retained in the first undergraduates stage, where there was no access to the internet possible.