Skip to main content

Divide to Better Classify

  • Conference paper
  • First Online:
Artificial Intelligence in Medicine (AIME 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12299))

Included in the following conference series:

Abstract

Medical information is present in various text-based resources such as electronic medical records, biomedical literature, social media, etc. Using all these sources to extract useful information is a real challenge. In this context, the single-label classification of texts is an important task. Recently, in-depth classifiers have shown their ability to achieve very good results. However, their results generally depend on the amount of data used during the training phase. In this article, we propose a new approach to increase text data. We have compared this approach for 5 real data sets with the main approaches in the literature and our proposal outperforms in all configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/ym001/Manteia/blob/master/notebook/notebook_Manteia_classification_augmentation_run_in_colab.ipynb.

  2. 2.

    https://github.com/Franck-Dernoncourt/pubmed-rct.

  3. 3.

    https://pages.semanticscholar.org/coronavirus-research.

  4. 4.

    https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29.

  5. 5.

    https://early.irlab.org/2018/index.html.

  6. 6.

    https://www.reddit.com/.

  7. 7.

    https://www.nltk.org/howto/wordnet.html.

  8. 8.

    https://yandex.com/.

  9. 9.

    https://github.com/ym001/DAIA/blob/master/Preliminary%20experiment.pdf.

References

  1. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP-IJCNLP 2019, Hong Kong, China, 2019. pp. 3613–3618. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1371

  2. Dernoncourt, F., Lee, J.Y.: Pubmed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Kondrak, G., Watanabe, T. (eds.) IJCNLP 2017. Volume 2: Short Papers, Taipei, Taiwan, 2017, pp. 308–313. Asian Federation of Natural Language Processing (2017)

    Google Scholar 

  3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT 2019. Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2019, pp. 4171–4186, Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  4. Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. CoRR abs/1808.09381 (2018)

    Google Scholar 

  5. Gräßer, F., Kallumadi, S., Malberg, H., Zaunseder, S.: Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning. In: Kostkova, P., Grasso, F., Castillo, C., Mejova, Y., Bosman, A., Edelstein, M. (eds.) Digital Health, DH 2018, pp. 121–125. ACM, Lyon (2018). https://doi.org/10.1145/3194658.3194677

  6. Gupta, R.: Data augmentation for low resource sentiment analysis using generative adversarial networks. CoRR abs/1902.06818 (2019)

    Google Scholar 

  7. Kobayashi, S.: Contextual augmentation: Data augmentation by words with paradigmatic relations. In: Walker, M.A., Ji, H., Stent, A. (eds.) NAACL-HLT, Volume 2 (Short Papers), New Orleans, Louisiana, USA, 1–6 June 2018, pp. 452–457. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/n18-2072

  8. Lan, Z.Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite BERT for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942 (2019)

  9. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)

    Google Scholar 

  10. Maldonado, R., Harabagiu, S.M.: Active deep learning for the identification of concepts and relations in electroencephalography reports. J. Biomed. Inform. 98, 103265 (2019). https://doi.org/10.1016/j.jbi.2019.103265

    Article  Google Scholar 

  11. Perez, L., Wang, J.: The effectiveness of data augmentation in image classification using deep learning. CoRR abs/1712.04621 (2017)

    Google Scholar 

  12. Ragheb, W., Moulahi, B., Azé, J., Bringay, S., Servajean, M.: Temporal mood variation: at the CLEF eRisk-2018 tasks for early risk detection on the Internet. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L. (eds.) Working Notes of CLEF 2018, Avignon, France, 10–14 September 2018, vol. 2125. CEUR-WS.org (2018)

    Google Scholar 

  13. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108 (2019)

  14. Shleifer, S.: Low resource text classification with ULMFit and backtranslation. CoRR abs/1903.09244 (2019)

    Google Scholar 

  15. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0

    Article  Google Scholar 

  16. Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. CoRR abs/1901.11196 (2019)

    Google Scholar 

  17. Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019)

    Google Scholar 

  18. Xie, Z., et al.: Data noising as smoothing in neural network language models. In: ICLR Proceedings of the Conference on Track 2017, OpenReview.net, Toulon (2017)

    Google Scholar 

  19. Zhang, X., LeCun, Y.: Text understanding from scratch. CoRR abs/1502.01710 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yves Mercadier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mercadier, Y., Azé, J., Bringay, S. (2020). Divide to Better Classify. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59137-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59136-6

  • Online ISBN: 978-3-030-59137-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics