Skip to main content

Multilingual Epidemic Event Extraction

  • Conference paper
  • First Online:
Towards Open and Trustworthy Digital Societies (ICADL 2021)

Abstract

In this paper, we focus on epidemic event extraction in multilingual and low-resource settings. The task of extracting epidemic events is defined as the detection of disease names and locations in a document. We experiment with a multilingual dataset comprising news articles from the medical domain with diverse morphological structures (Chinese, English, French, Greek, Polish, and Russian). We investigate various Transformer-based models, also adopting a two-stage strategy, first finding the documents that contain events and then performing event extraction. Our results show that error propagation to the downstream task was higher than expected. We also perform an in-depth analysis of the results, concluding that different entity characteristics can influence the performance. Moreover, we perform several preliminary experiments for the low-resourced languages present in the dataset using the mean teacher semi-supervised technique. Our findings show the potential of pre-trained language models benefiting from the incorporation of unannotated data in the training process.

This work has been supported by the European Unionā€™s Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embeddia). It has also been supported by the French Embassy in Kenya and the French Foreign Ministry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The DAnIEL dataset is available at https://daniel.greyc.fr/public/index.php?a=corpus.

  2. 2.

    The token-level annotated dataset is available at https://bit.ly/3kUQcXD.

  3. 3.

    For this model, we used the parameters recommended in [11].

  4. 4.

    https://huggingface.co/bert-base-multilingual-cased. This model was pre-trained on the top 104 languages having the largest Wikipedia edition using a masked language modeling (MLM) objective.

  5. 5.

    https://huggingface.co/bert-base-multilingual-uncased. This model was pre-trained on the top 102 languages having the largest Wikipedia editions using a masked language modeling (MLM) objective.

  6. 6.

    XLM-RoBERTa-base was trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages.

  7. 7.

    https://jku-vds-lab.at/tools/upset/.

  8. 8.

    The code [16] is available here: https://github.com/neulab/InterpretEval..

References

  1. Aiello, A.E., Renson, A., Zivich, P.N.: Social media-and internet-based disease surveillance for public health. Ann. Rev. Public Health 41, 101ā€“118 (2020)

    ArticleĀ  Google ScholarĀ 

  2. Bernardo, T.M., Rajic, A., Young, I., Robiadek, K., Pham, M.T., Funk, J.A.: Scoping review on search queries and social media for disease surveillance: a chronology of innovation. J. Med. Internet Res. 15(7), e147 (2013)

    ArticleĀ  Google ScholarĀ 

  3. Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317 (2019)

  4. Brixtel, R., Lejeune, G., Doucet, A., Lucas, N.: Any language early detection of epidemic diseases from web news streams. In: 2013 IEEE International Conference on Healthcare Informatics, pp. 159ā€“168. IEEE (2013)

    Google ScholarĀ 

  5. Casey, A., et al.: Plague dot text: text mining and annotation of outbreak reports of the Third Plague Pandemic (1894ā€“1952). J. Data Min. Digit. Humanit. HistoInf. (2021). https://jdmdh.episciences.org/7105

  6. Chen, S., Pei, Y., Ke, Z., Silamu, W.: Low-resource named entity recognition via the pre-training model. Symmetry 13(5), 786 (2021)

    ArticleĀ  Google ScholarĀ 

  7. Choi, J., Cho, Y., Shim, E., Woo, H.: Web-based infectious disease surveillance systems and public health perspectives: a systematic review. BMC Public Health 16(1), 1ā€“10 (2016)

    ArticleĀ  Google ScholarĀ 

  8. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, 5ā€“10 July 2020, pp. 8440ā€“8451. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.747/

  9. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Wallach, H., Larochelle, H., Beygelzimer, A., dā€™ AlchĆ©-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 7059ā€“7069. Curran Associates, Inc. (2019). http://papers.nips.cc/paper/8928-cross-lingual-language-model-pretraining.pdf

  10. Dean, K., Krauer, F., Schmid, B.: Epidemiology of a bubonic plague outbreak in Glasgow, Scotland in 1900. R. Soc. Open Sci. 6, 181695 (2019). https://doi.org/10.1098/rsos.181695

    ArticleĀ  Google ScholarĀ 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171ā€“4186. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423

  12. Ding, B., et al.: DAGA: data augmentation with a generation approach for low-resource tagging tasks. arXiv preprint arXiv:2011.01549 (2020)

  13. Doan, S., Ngo, Q.H., Kawazoe, A., Collier, N.: Global health monitor-a web-based system for detecting and mapping infectious diseases. In: Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II (2008)

    Google ScholarĀ 

  14. DĆ³rea, F.C., Revie, C.W.: Data-driven surveillance: effective collection, integration and interpretation of data to support decision-making. Front. Vet. Sci. 8, 225 (2021)

    Google ScholarĀ 

  15. Feng, X., Feng, X., Qin, B., Feng, Z., Liu, T.: Improving low resource named entity recognition using cross-lingual knowledge transfer. In: IJCAI, pp. 4071ā€“4077 (2018)

    Google ScholarĀ 

  16. Fu, J., Liu, P., Neubig, G.: Interpretable multi-dataset evaluation for named entity recognition. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6058ā€“6069 (2020)

    Google ScholarĀ 

  17. Fu, J., Liu, P., Zhang, Q.: Rethinking generalization of neural models: a named entity recognition case study. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7732ā€“7739 (2020)

    Google ScholarĀ 

  18. Glaser, I., Sadegharmaki, S., Komboz, B., Matthes, F.: Data scarcity: Methods to improve the quality of text classification. In: ICPRAM, pp. 556ā€“564 (2021)

    Google ScholarĀ 

  19. Grancharova, M., Berg, H., Dalianis, H.: Improving named entity recognition and classification in class imbalanced Swedish electronic patient records through resampling. In: Eighth Swedish Language Technology Conference (SLTC). Fƶrlag Gƶteborgs Universitet (2020)

    Google ScholarĀ 

  20. Gururangan, S., et al.: Donā€™t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020)

  21. Hamborg, F., Lachnit, S., Schubotz, M., Hepp, T., Gipp, B.: Giveme5W: main event retrieval from news articles by extraction of the five journalistic W questions. In: Chowdhury, G., McLeod, J., Gillet, V., Willett, P. (eds.) iConference 2018. LNCS, vol. 10766, pp. 356ā€“366. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78105-1_39

    ChapterĀ  Google ScholarĀ 

  22. Joshi, A., Karimi, S., Sparks, R., Paris, C., Macintyre, C.R.: Survey of text-based epidemic intelligence: a computational linguistics perspective. ACM Comput. Surv. (CSUR) 52(6), 1ā€“19 (2019)

    ArticleĀ  Google ScholarĀ 

  23. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64ā€“77 (2020)

    ArticleĀ  Google ScholarĀ 

  24. Kozareva, Z., Bonev, B., Montoyo, A.: Self-training and co-training applied to Spanish named entity recognition. In: Gelbukh, A., de Albornoz, Ɓ., Terashima-MarĆ­n, H. (eds.) MICAI 2005. LNCS (LNAI), vol. 3789, pp. 770ā€“779. Springer, Heidelberg (2005). https://doi.org/10.1007/11579427_78

    ChapterĀ  Google ScholarĀ 

  25. Lampos, V., Zou, B., Cox, I.J.: Enhancing feature selection using word embeddings: the case of flu surveillance. In: Proceedings of the 26th International Conference on World Wide Web, pp. 695ā€“704 (2017)

    Google ScholarĀ 

  26. Lejeune, G., Brixtel, R., Doucet, A., Lucas, N.: Multilingual event extraction for epidemic detection. Artif. Intell. Med. 65 (2015). https://doi.org/10.1016/j.artmed.2015.06.005

  27. Lejeune, G., Brixtel, R., Lecluze, C., Doucet, A., Lucas, N.: Added-value of automatic multilingual text analysis for epidemic surveillance. In: Peek, N., MarĆ­n Morales, R., Peleg, M. (eds.) AIME 2013. LNCS (LNAI), vol. 7885, pp. 284ā€“294. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38326-7_40

    ChapterĀ  Google ScholarĀ 

  28. Lejeune, G., Doucet, A., Yangarber, R., Lucas, N.: Filtering news for epidemic surveillance: towards processing more languages with fewer resources. In: Proceedings of the 4th Workshop on Cross Lingual Information Access, pp. 3ā€“10 (2010)

    Google ScholarĀ 

  29. Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R., et al.: Performance measures for information extraction. In: Proceedings of DARPA Broadcast News Workshop, Herndon, VA, pp. 249ā€“252 (1999)

    Google ScholarĀ 

  30. Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M.: Multilingual epidemiological text classification: a comparative study. In: COLING, International Conference on Computational Linguistics (2020)

    Google ScholarĀ 

  31. Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M.: Token-level multilingual epidemic dataset for event extraction. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds.) TPDL 2021. LNCS, vol. 12866, pp. 55ā€“59. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86324-1_6

    ChapterĀ  Google ScholarĀ 

  32. Mutuvi, S., Doucet, A., Odeo, M., Jatowt, A.: Evaluating the impact of OCR errors on topic modeling. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 3ā€“14. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_1

    ChapterĀ  Google ScholarĀ 

  33. Neubig, G., et al.: compare-MT: a tool for holistic comparison of language generation systems. arXiv preprint arXiv:1903.07926 (2019)

  34. Neudecker, C., Antonacopoulos, A.: Making Europeā€™s historical newspapers searchable. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 405ā€“410. IEEE (2016)

    Google ScholarĀ 

  35. Ng, V., Rees, E.E., Niu, J., Zaghool, A., Ghiasbeglou, H., Verster, A.: Application of natural language processing algorithms for extracting information from news articles in event-based surveillance. Can. Commun. Dis. Rep. 46(6), 186ā€“191 (2020)

    ArticleĀ  Google ScholarĀ 

  36. Nguyen, N.K., Boros, E., Lejeune, G., Doucet, A.: Impact analysis of document digitization on event extraction. In: 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2020), vol. 2735, pp. 17ā€“28 (2020)

    Google ScholarĀ 

  37. Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., Ji, H.: Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1946ā€“1958 (2017)

    Google ScholarĀ 

  38. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. Arxiv (2018)

    Google ScholarĀ 

  39. Riedl, M., PadĆ³, S.: A named entity recognition shootout for German. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 120ā€“125 (2018)

    Google ScholarĀ 

  40. SalathƩ, M., Freifeld, C.C., Mekaru, S.R., Tomasulo, A.F., Brownstein, J.S.: Influenza a (H7N9) and the importance of digital epidemiology. N. Engl. J. Med. 369(5), 401 (2013)

    ArticleĀ  Google ScholarĀ 

  41. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017)

  42. Van Asch, V., Daelemans, W.: Predicting the effectiveness of self-training: application to sentiment classification. arXiv preprint arXiv:1601.03288 (2016)

  43. van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373ā€“440 (2019). https://doi.org/10.1007/s10994-019-05855-6

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998ā€“6008 (2017)

    Google ScholarĀ 

  45. Walker, D., Lund, W.B., Ringger, E.: Evaluating models of latent document semantics in the presence of OCR errors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 240ā€“250 (2010)

    Google ScholarĀ 

  46. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)

  47. Wang, C.K., Singh, O., Tang, Z.L., Dai, H.J.: Using a recurrent neural network model for classification of tweets conveyed influenza-related information. In: Proceedings of the International Workshop on Digital Disease Detection Using Social Media 2017 (DDDSM-2017), pp. 33ā€“38 (2017)

    Google ScholarĀ 

  48. Wang, W., Huang, Z., Harper, M.: Semi-supervised learning for part-of-speech tagging of mandarin transcribed speech. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4, pp. IV-137. IEEE (2007)

    Google ScholarĀ 

  49. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189ā€“196 (1995)

    Google ScholarĀ 

  50. Zhu, X.J.: Semi-supervised learning literature survey (2005)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stephen Mutuvi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M. (2021). Multilingual Epidemic Event Extraction. In: Ke, HR., Lee, C.S., Sugiyama, K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science(), vol 13133. Springer, Cham. https://doi.org/10.1007/978-3-030-91669-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91669-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91668-8

  • Online ISBN: 978-3-030-91669-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics