Abstract
Medical information is present in various text-based resources such as electronic medical records, biomedical literature, social media, etc. Using all these sources to extract useful information is a real challenge. In this context, the single-label classification of texts is an important task. Recently, in-depth classifiers have shown their ability to achieve very good results. However, their results generally depend on the amount of data used during the training phase. In this article, we propose a new approach to increase text data. We have compared this approach for 5 real data sets with the main approaches in the literature and our proposal outperforms in all configurations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP-IJCNLP 2019, Hong Kong, China, 2019. pp. 3613–3618. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1371
Dernoncourt, F., Lee, J.Y.: Pubmed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Kondrak, G., Watanabe, T. (eds.) IJCNLP 2017. Volume 2: Short Papers, Taipei, Taiwan, 2017, pp. 308–313. Asian Federation of Natural Language Processing (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT 2019. Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2019, pp. 4171–4186, Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. CoRR abs/1808.09381 (2018)
Gräßer, F., Kallumadi, S., Malberg, H., Zaunseder, S.: Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning. In: Kostkova, P., Grasso, F., Castillo, C., Mejova, Y., Bosman, A., Edelstein, M. (eds.) Digital Health, DH 2018, pp. 121–125. ACM, Lyon (2018). https://doi.org/10.1145/3194658.3194677
Gupta, R.: Data augmentation for low resource sentiment analysis using generative adversarial networks. CoRR abs/1902.06818 (2019)
Kobayashi, S.: Contextual augmentation: Data augmentation by words with paradigmatic relations. In: Walker, M.A., Ji, H., Stent, A. (eds.) NAACL-HLT, Volume 2 (Short Papers), New Orleans, Louisiana, USA, 1–6 June 2018, pp. 452–457. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-2072
Lan, Z.Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite BERT for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942 (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
Maldonado, R., Harabagiu, S.M.: Active deep learning for the identification of concepts and relations in electroencephalography reports. J. Biomed. Inform. 98, 103265 (2019). https://doi.org/10.1016/j.jbi.2019.103265
Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. CoRR abs/1712.04621 (2017)
Ragheb, W., Moulahi, B., Azé, J., Bringay, S., Servajean, M.: Temporal mood variation: at the CLEF eRisk-2018 tasks for early risk detection on the Internet. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L. (eds.) Working Notes of CLEF 2018, Avignon, France, 10–14 September 2018, vol. 2125. CEUR-WS.org (2018)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108 (2019)
Shleifer, S.: Low resource text classification with ULMFit and backtranslation. CoRR abs/1903.09244 (2019)
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0
Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. CoRR abs/1901.11196 (2019)
Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019)
Xie, Z., et al.: Data noising as smoothing in neural network language models. In: ICLR Proceedings of the Conference on Track 2017, OpenReview.net, Toulon (2017)
Zhang, X., LeCun, Y.: Text understanding from scratch. CoRR abs/1502.01710 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Mercadier, Y., Azé, J., Bringay, S. (2020). Divide to Better Classify. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-59137-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59136-6
Online ISBN: 978-3-030-59137-3
eBook Packages: Computer ScienceComputer Science (R0)