Abstract
Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. “Active” machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method’s potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF (2005) Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 12:207–216
Bekhuis T, Demner-Fushman D (2012) Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers. Artif Intell Med 55(3):197–207
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chen Y, Mani S, Xu H (2012) Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform 45(2):265–272. https://doi.org/10.1016/j.jbi.2011.11.003
Dasgupta S (2009) The two faces of active learning. In: Proceedings of the twentieth conference on algorithmic learning theory
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM conference on information and knowledge management, ACM, pp 127–136
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4:1–58
Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101
Harris ZS (1954) Distributional structure. WORD 10:146–162
Ingersoll GS, Morton TS, Farris AL (2013) Taming text: how to find, organize, and manipulate it. Manning Publications Co., New York
Jonnalagadda S, Goyal P, Huffman M (2015) Automating data extraction in systematic reviews: a systematic review. Syst Rev 15(4):78. https://doi.org/10.1186/s13643-015-0066-7
Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 148–156
Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR conference on research and development in information retrieval. ACM/Springer, pp 3–12
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4:5
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Python Software Foundation. Python language reference (Version 2.7)
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 441–448
Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, Madison
Settles B, Craven M, Ray S (2008) Multiple-instance active learning. Adv Neural Inf Process Syst 20:1289–1296
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In Proceedings of the ACM workshop on computational learning theory, pp 287–294
Shemilt I et al (2014) Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Res Synth Methods 5(1):31–49
Tomanek K, Olsson F (2009) A web survey on the use of active learning to support annotation of text data. In Proceedings of the NAACL HLT workshop on active learning for natural language processing. ACL Press, pp 45–48
U.S. EPA (2015) IRIS toxicological review of dibutyl phthalate (Dbp) (preliminary assessment materials). U.S. Environmental Protection Agency, Washington, DC, EPA/635/R-13/302
Varghese A, Cawley M, Hong T (2017) Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environ Syst Decis. https://doi.org/10.1007/s10669-017-9670-5
Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform 11:55
Acknowledgements
The development of the methods presented here was fully supported by ICF. The results presented here were generated for the purposes of this paper alone. We thank Gregory Carter for review and helpful comments.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
10669_2019_9717_MOESM1_ESM.docx
The supplementary data include 18 tables that correspond to the results generated in the simulations summarized as trends in Figs. 2–5. In the interests of brevity, these tables present simulation results only up to the point where the actual omission fraction of relevant documents in less than the required threshold of 0.05. Each table is supplied with a proposed interpretation of apparent trends in the context of the theoretical discussions in Section 2. Supplementary material 1 (DOCX 68 KB)
Rights and permissions
About this article
Cite this article
Varghese, A., Hong, T., Hunter, C. et al. Active learning in automated text classification: a case study exploring bias in predicted model performance metrics. Environ Syst Decis 39, 269–280 (2019). https://doi.org/10.1007/s10669-019-09717-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10669-019-09717-3