Addressing Extreme Imbalance for Detecting Medications Mentioned in Twitter User Timelines

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12721)


Tweets mentioning medications are valuable for efforts in digital epidemiology to supplement traditional methods of monitoring public health. A major obstacle, however, is to differentiate them from the large majority of tweets on other topics posted in a user’s timeline: solving the infamous ‘needle in a haystack’ problem. While deep learning models have significantly improved classification, their performance and inference processing time remain low on extremely imbalanced corpora where the tweets of interest are less than 1% of all tweets. In this study, we empirically evaluate under-sampling, fine-tuning, and filtering heuristics to train such classifiers. Using a corpus of 212 Twitter timelines (181,607 tweets with only 0.2% tweets mentioning a medication), our results show that combining these heuristics is necessary to impact the classifier’s performance. In our intrinsic evaluation, a classifier based on a lexicon and a BERT-base neural network achieved a 0.838 F1-score, a score similar to the score achieved by the best classifier on this task during the #SMM4H’20 competition, but it processed the corpus 28 times faster - a positive result, since processing speed is still a roadblock to deploying classifiers on large cohorts of Twitter users needed for pharmacovigilance. In our extrinsic evaluation, our classifier helped a labeler to extract the spans of medications more accurately and achieved a 0.76 Strict F1-score. To the best of our knowledge, this is the first evaluation of medications extraction from Twitter timelines and it establishes the first benchmark for future studies.


Social Media Medication detection Text classification 



This work was supported by National Library of Medicine grant number R01LM011176 to GG-H. The content is solely the responsibility of the authors and does not necessarily represent the official view of National Library of Medicine.


  1. 1.
    Batbaatar, E., Ryu, K.H.: Ontology-based healthcare named entity recognition from twitter messages using a recurrent neural network approach. Int. J. Environ. Res. Public Health 16(16:3628) (2019)Google Scholar
  2. 2.
    Carbonell, P., Mayer, M.A., Bravo, A.: Exploring brand-name drug mentions on twitter for pharmacovigilance. Stud. Health Technol. Inform. 210, 55–59 (2015)Google Scholar
  3. 3.
    Casola, S., Lavelli, A.: FBK@SMM4H2020: RoBERTa for detecting medications on Twitter. In: Proceedings of the Fifth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task (2020)Google Scholar
  4. 4.
    Dang, H.N., Lee, K., Henry, S., Uzuner, O.: Ensemble BERT for classifying medication-mentioning tweets. In: Proceedings of the Fifth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task (2020)Google Scholar
  5. 5.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics (2019)Google Scholar
  6. 6.
    Godin, F., Vandersmissen, B., De Neve, W., Van de Walle, R.: Multimedia lab @ ACL WNUT NER shared task: named entity recognition for Twitter microposts using distributed word representations. In: Proceedings of the Workshop on Noisy User-generated Text. Association for Computational Linguistics (2015)Google Scholar
  7. 7.
    Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)CrossRefGoogle Scholar
  8. 8.
    Jimeno-Yepes, A., MacKinlay, A., Han, B., Chen, Q.: Identifying diseases, drugs, and symptoms in twitter. Stud. Health Technol. Inform. 216, 643–647 (2019)Google Scholar
  9. 9.
    Klein, A.Z., et al.: Overview of the fifth social media mining for health applications (#smm4h) workshop & shared task at coling 2020Google Scholar
  10. 10.
    Sarker, A., Gonzalez-Hernandez, G.: A corpus for mining drug-related knowledge from twitter chatter: language models and their utilities. Data Brief 10, 122–131 (2017)CrossRefGoogle Scholar
  11. 11.
    Shaban, H.: Twitter reveals its daily active user numbers for the first time (2019).
  12. 12.
    Sinnenberg, L., Buttenheim, A.M., Padrez, K., Mancheno, C., Ungar, L., Merchant, R.M.: Twitter as a tool for health research: A systematic review. Am. J. Public Health 107(1), e1–e8 (2017)CrossRefGoogle Scholar
  13. 13.
    Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: on the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2 (2019)
  14. 14.
    Weissenbacher, D.: Track 3 - automatic extraction of medication names in tweets (2020).
  15. 15.
    Weissenbacher, D., Sarker, A., Klein, A., O’Connor, K., Magge, A., Gonzalez-Hernandez, G.: Deep neural networks ensemble for detecting medication mentions in tweets. J. Am. Med. Inform. Assoc. 26(12), 1618–1626 (2019)CrossRefGoogle Scholar
  16. 16.
    Weissenbacher, D., Sarker, A., Paul, M.J., Gonzalez-Hernandez, G.: Overview of the third social media mining for health (SMM4H) shared tasks at EMNLP 2018. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task. Association for Computational Linguistics (2018)Google Scholar
  17. 17.
    Wu, C., Wu, F., Liu, J., Wu, S., Huang, Y., Xie, X.: Detecting tweets mentioning drug name and adverse drug reaction with hierarchical tweet representation and multi-head self-attention. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task. Association for Computational Linguistics (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2021

Authors and Affiliations

  1. 1.Department of Biostatistics, Epidemiology and Informatics (DBEI)University of PennsylvaniaPhiladelphiaUSA
  2. 2.School of Computing, Informatics, and Decision Systems Engineering (CIDSE)Arizona State UniversityTempeUSA

Personalised recommendations