Skip to main content
Log in

Active learning in automated text classification: a case study exploring bias in predicted model performance metrics

  • Published:
Environment Systems and Decisions Aims and scope Submit manuscript

Abstract

Machine learning has emerged as a cost-effective innovation to support systematic literature reviews in human health risk assessments and other contexts. Supervised machine learning approaches rely on a training dataset, a relatively small set of documents with human-annotated labels indicating their topic, to build models that automatically classify a larger set of unclassified documents. “Active” machine learning has been proposed as an approach that limits the cost of creating a training dataset by interactively and sequentially focussing on training only the most informative documents. We simulate active learning using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to two topics relating to toxicology and risk assessment. We examine the performance of alternative sampling approaches to sequentially expanding the training dataset, specifically looking at uncertainty-based sampling and probability-based sampling. We discover that while such active learning methods can potentially reduce training dataset size compared to random sampling, predictions of model performance in active learning are likely to suffer from statistical bias that negates the method’s potential benefits. We discuss approaches and the extent to which the bias resulting from skewed sampling can be compensated. We propose a useful role for active learning in contexts in which the accuracy of model performance metrics is not critical and/or where it is beneficial to rapidly create a class-balanced training dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.nlm.nih.gov/bsd/bsd_key.html.

  2. https://hero.epa.gov/hero/index.cfm/content/home.

  3. https://hero.epa.gov/hero/index.cfm/project/page/project_id/2489.

References

  • Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF (2005) Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 12:207–216

    Article  Google Scholar 

  • Bekhuis T, Demner-Fushman D (2012) Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers. Artif Intell Med 55(3):197–207

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    Google Scholar 

  • Chen Y, Mani S, Xu H (2012) Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform 45(2):265–272. https://doi.org/10.1016/j.jbi.2011.11.003

    Article  Google Scholar 

  • Dasgupta S (2009) The two faces of active learning. In: Proceedings of the twentieth conference on algorithmic learning theory

  • Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM conference on information and knowledge management, ACM, pp 127–136

  • Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4:1–58

    Article  Google Scholar 

  • Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101

    Article  CAS  Google Scholar 

  • Harris ZS (1954) Distributional structure. WORD 10:146–162

    Article  Google Scholar 

  • Ingersoll GS, Morton TS, Farris AL (2013) Taming text: how to find, organize, and manipulate it. Manning Publications Co., New York

    Google Scholar 

  • Jonnalagadda S, Goyal P, Huffman M (2015) Automating data extraction in systematic reviews: a systematic review. Syst Rev 15(4):78. https://doi.org/10.1186/s13643-015-0066-7

    Article  Google Scholar 

  • Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 148–156

  • Lewis D, Gale W (1994) A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR conference on research and development in information retrieval. ACM/Springer, pp 3–12

  • O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4:5

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  • Python Software Foundation. Python language reference (Version 2.7)

  • Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the international conference on machine learning (ICML). Morgan Kaufmann, Burlington, pp 441–448

  • Settles B (2010) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, Madison

  • Settles B, Craven M, Ray S (2008) Multiple-instance active learning. Adv Neural Inf Process Syst 20:1289–1296

    Google Scholar 

  • Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In Proceedings of the ACM workshop on computational learning theory, pp 287–294

  • Shemilt I et al (2014) Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Res Synth Methods 5(1):31–49

    Article  Google Scholar 

  • Tomanek K, Olsson F (2009) A web survey on the use of active learning to support annotation of text data. In Proceedings of the NAACL HLT workshop on active learning for natural language processing. ACL Press, pp 45–48

  • U.S. EPA (2015) IRIS toxicological review of dibutyl phthalate (Dbp) (preliminary assessment materials). U.S. Environmental Protection Agency, Washington, DC, EPA/635/R-13/302

  • Varghese A, Cawley M, Hong T (2017) Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. Environ Syst Decis. https://doi.org/10.1007/s10669-017-9670-5

    Article  Google Scholar 

  • Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform 11:55

    Article  Google Scholar 

Download references

Acknowledgements

The development of the methods presented here was fully supported by ICF. The results presented here were generated for the purposes of this paper alone. We thank Gregory Carter for review and helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun Varghese.

Electronic supplementary material

Below is the link to the electronic supplementary material.

10669_2019_9717_MOESM1_ESM.docx

The supplementary data include 18 tables that correspond to the results generated in the simulations summarized as trends in Figs. 2–5. In the interests of brevity, these tables present simulation results only up to the point where the actual omission fraction of relevant documents in less than the required threshold of 0.05. Each table is supplied with a proposed interpretation of apparent trends in the context of the theoretical discussions in Section 2. Supplementary material 1 (DOCX 68 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Varghese, A., Hong, T., Hunter, C. et al. Active learning in automated text classification: a case study exploring bias in predicted model performance metrics. Environ Syst Decis 39, 269–280 (2019). https://doi.org/10.1007/s10669-019-09717-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10669-019-09717-3

Keywords

Navigation