Skip to main content

Advertisement

Log in

Deep learning in automated text classification: a case study using toxicological abstracts

  • Published:
Environment Systems and Decisions Aims and scope Submit manuscript

Abstract

Machine learning technology has been widely adopted as a cost-saving document prioritization approach in systematic literature reviews related to human health risk assessments. Supervised approaches use a training dataset, a relatively small set of documents with human-annotated labels indicating the topic of each document, to build models that automatically predict the labels of a much larger set of unlabelled documents. Deep learning algorithms form a branch of machine learning that relies on complex neural network architectures to learn the features of the object to be classified. Although deep learning algorithms have till recently mainly been applied for image, video, and audio classification, they are increasingly being deployed on text classification problems. To explore the potential advantages and practicalities of using deep learning algorithms in the document prioritization step of systematic literature reviews, we compare the performance of the most commonly used deep learning architectures with more traditional machine learning models using a dataset of approximately 7000 abstracts from the scientific literature related to the chemical arsenic. The dataset was previously annotated by subject matter experts with regard to relevance to toxicological mode of action. We examine the relative performance of each algorithm type at alternative levels of training by sequentially expanding the training dataset to generate a learning curve. We find that deep learning offers increased performance in some instances but also requires more data to train algorithms, increased model training time, increased computational power, and more labor-intensive algorithm tuning compared to baseline traditional machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. https://hero.epa.gov/hero/index.cfm/content/home.

  2. https://hero.epa.gov/hero/index.cfm/project/page/project_id/2489.

  3. https://www.ncbi.nlm.nih.gov/pubmed/.

  4. https://www.icf-docter.com/.

References

  • ICF (2015) Document classification and topic extraction resource (DoCTER). https://www.icf-docter.com

  • Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF (2005) Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 12:207–216

    Article  Google Scholar 

  • Bacchi S et al (2019) Deep learning natural language processing successfully predicts the cerebrovascular cause of transient ischemic attack-like presentations. Stroke 50(3):758–760

    Article  Google Scholar 

  • Bekhuis T, Demner-Fushman D (2012) Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers. Artif Intell Med 55(3):197–207

    Article  Google Scholar 

  • Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–39

    Article  Google Scholar 

  • Chollet, F. (2015) keras, GitHub. https://github.com/fchollet/keras

  • Del Fiol G et al (2018) A deep learning method to automatically identify reports of scientifically rigorous clinical research from the biomedical literature: comparative analytic study. J Med Internet Res 20(6):e10281

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  CAS  Google Scholar 

  • Ingersoll GS, Morton TS, Farris AL (2013) Taming text: "How to Find, Organize, and Manipulate It". Manning Publications Co, New York

    Google Scholar 

  • Jonnalagadda S, Goyal P, Huffman M (2015) Automating data extraction in systematic reviews: a systematic review. Syst Rev 15(4):78. https://doi.org/10.1186/s13643-015-0066-7

    Article  Google Scholar 

  • Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods on Natural Language Processing (EMNLP-14), pp. 1746–1751.

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, and Jeffrey D (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • O'Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4:5

    Article  Google Scholar 

  • Pennington J, Socher R, and Manning C (2013) Glove: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.3115/v1/D14-1162.

  • Python Software Foundation. Python language reference (Version 2.7).

  • Rehurek R, Sojka P (2010) Software framework for topic modelling with large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta, ELRA. https://is.muni.cz/publication/884893/en.

  • Segura-Bedmar I et al (2018) Predicting of anaphylaxis in big data EMR by exploring machine learning approaches. J Biomed Inform 87:50–59

    Article  Google Scholar 

  • Shemilt I et al (2014) Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Res Synth Methods 5(1):31–49

    Article  Google Scholar 

  • Sulieman L et al (2017) Classifying patient portal messages using convolutional neural networks. J Biomed Inform 74:59–70

    Article  Google Scholar 

  • Varghese A, Cawley M, Hong T (2017) Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts. https://doi.org/10.1007/s10669-017-9670-5

    Article  Google Scholar 

  • Varghese A, Hong T, Hunter C, Agyeman-Badu G, Cawley M (2019) Active learning in automated text classification: a case study exploring bias in predicted model performance metrics. Environ Syst Decis https://doi.org/10.1007/s10669-019-09717-3

    Article  Google Scholar 

  • Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 11:55

    Article  Google Scholar 

  • Wang YS et al (2019) A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Mak 19:1

    Article  Google Scholar 

  • Weng WH et al (2017) Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC Med Inform Decis Mak 17:155

    Article  Google Scholar 

  • Zhang Y, Wallace B (2015) A sensitivity analysis of (and Practitioners’ Guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820

  • Zhou P et al. (2016) Text classification improved by integrating bidirectional lstm with two dimensional max pooling. In Proceedings of COLING 2016

Download references

Acknowledgements

The development of the methods presented here was fully supported by ICF. The results presented here were generated for the purposes of this paper alone.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun Varghese.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 56 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Varghese, A., Agyeman-Badu, G. & Cawley, M. Deep learning in automated text classification: a case study using toxicological abstracts. Environ Syst Decis 40, 465–479 (2020). https://doi.org/10.1007/s10669-020-09763-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10669-020-09763-2

Keywords

Navigation