Abstract
Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In this paper, we use “abstracts” and “articles” interchangeably.
- 3.
Since year is the minimum unit in the MeSH indexing dataset.
- 4.
- 5.
- 6.
We didn’t compare with the BioASQ challenge results for several reasons: (1) labels of the challenge test sets are not publicly available; (2) submitted results are generated by model ensembles. In the experiments, we use the challenge winner system, MeSHProbeNet [24], as a strong baseline (i.e. the Backbone Model).
References
Aronson, A.R.: Effective mapping of biomedical text to the UMLs Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2001)
Brzeziński, D., Stefanowski, J.: Accuracy updated ensemble for data streams with concept drift. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011. LNCS (LNAI), vol. 6679, pp. 155–163. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21222-2_19
Costa, J., Silva, C., Antunes, M., Ribeiro, B.: Concept drift awareness in twitter streams. In: 2014 13th International Conference on Machine Learning and Applications, pp. 294–299. IEEE (2014)
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tracking concept drift in spam filtering. In: Macintosh, A., Ellis, R., Allen, T. (eds.) SGAI 2004, pp. 3–16. Springer, London (2004). https://doi.org/10.1007/1-84628-103-2_1
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jin, Q., Dhingra, B., Cohen, W., Lu, X.: AttentionMeSH: simple, effective and interpretable automatic mesh indexer. In: Proceedings of the 6th BioASQ Workshop A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, pp. 47–56 (2018)
Karaa, A., Goldstein, A.: The spectrum of clinical presentation, diagnosis, and management of mitochondrial forms of diabetes. Pediatr. Diab. 16(1), 1–9 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koren, Y.: Collaborative filtering with temporal dynamics. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 447–456. ACM (2009)
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion 37, 132–156 (2017)
Moen, S., Ananiadou, T.S.S.: Distributional semantics resources for biomedical text processing. In: Proceedings of LBM (2013)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)
Peng, S., You, R., Wang, H., Zhai, C., Mamitsuka, H., Zhu, S.: DeepMeSH: deep semantic representation for improving large-scale mesh indexing. Bioinformatics 32(12), i70–i79 (2016)
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. ACM (2001)
Sun, J., Li, H.: Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Syst. Appl. 38(3), 2566–2576 (2011)
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 138 (2015)
Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dep. Trinity College Dublin 106(2), 58 (2004)
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235. ACM (2003)
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Xun, G., Jha, K., Yuan, Y., Wang, Y., Zhang, A.: MeSHProbeNet: a self-attentive probe net for mesh indexing. Bioinformatics 35, 3794–3802 (2019)
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approximation 26(2), 289–315 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jin, Q., Ding, H., Li, L., Huang, H., Wang, L., Yan, J. (2020). Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12114. Springer, Cham. https://doi.org/10.1007/978-3-030-59419-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-59419-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59418-3
Online ISBN: 978-3-030-59419-0
eBook Packages: Computer ScienceComputer Science (R0)