Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning

Jin, Qiao; Ding, Haoyang; Li, Linfeng; Huang, Haitao; Wang, Lei; Yan, Jun

doi:10.1007/978-3-030-59419-0_29

Qiao Jin¹⁴,
Haoyang Ding¹⁴,
Linfeng Li^14,15,
Haitao Huang¹⁶,
Lei Wang¹⁴ &
…
Jun Yan¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12114))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2070 Accesses
1 Citations

Abstract

Medical Subject Headings (MeSH) is a controlled thesaurus developed by the National Library of Medicine (NLM). MeSH covers a wide variety of biomedical topics like diseases and drugs, which are used to classify PubMed articles. Human indexers at NLM have been annotating the PubMed articles with MeSH for decades, and have collected millions of MeSH-labeled articles. Recently, many deep learning algorithms have been developed to automatically annotate the MeSH terms, utilizing this large-scale MeSH indexing dataset. However, most of the models are trained on all articles non-discriminatively, ignoring the temporal structure of the dataset. In this paper, we uncover and thoroughly characterize the problem of MeSH indexing dataset shift (MeSHIFT), meaning that the data distribution changes with time. MeSHIFT includes the shift of input articles, output MeSH labels and annotation rules. We found that machine learning models suffer from performance loss for not tackling the problem of MeSHIFT. Towards this end, we present a novel method, time-aware concept embedding learning (TaCEL), as an attempt to solve it. TaCEL is a plug-in module which can be easily incorporated in other automatic MeSH indexing models. Results show that TaCEL improves current state-of-the-art models with only minimum additional costs. We hope this work can facilitate understanding of the MeSH indexing dataset, especially its temporal structure, and provide a solution that can be used to improve current models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.nlm.nih.gov/mesh/.
2.
In this paper, we use “abstracts” and “articles” interchangeably.
3.
Since year is the minimum unit in the MeSH indexing dataset.
4.
http://participants-area.bioasq.org/general_information/Task7a/.
5.
https://meshb.nlm.nih.gov/record/ui?ui=D008969.
6.
We didn’t compare with the BioASQ challenge results for several reasons: (1) labels of the challenge test sets are not publicly available; (2) submitted results are generated by model ensembles. In the experiments, we use the challenge winner system, MeSHProbeNet [24], as a strong baseline (i.e. the Backbone Model).

References

Aronson, A.R.: Effective mapping of biomedical text to the UMLs Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2001)
Google Scholar
Brzeziński, D., Stefanowski, J.: Accuracy updated ensemble for data streams with concept drift. In: Corchado, E., Kurzyński, M., Woźniak, M. (eds.) HAIS 2011. LNCS (LNAI), vol. 6679, pp. 155–163. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21222-2_19
Chapter Google Scholar
Costa, J., Silva, C., Antunes, M., Ribeiro, B.: Concept drift awareness in twitter streams. In: 2014 13th International Conference on Machine Learning and Applications, pp. 294–299. IEEE (2014)
Google Scholar
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tracking concept drift in spam filtering. In: Macintosh, A., Ellis, R., Allen, T. (eds.) SGAI 2004, pp. 3–16. Springer, London (2004). https://doi.org/10.1007/1-84628-103-2_1
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jin, Q., Dhingra, B., Cohen, W., Lu, X.: AttentionMeSH: simple, effective and interpretable automatic mesh indexer. In: Proceedings of the 6th BioASQ Workshop A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, pp. 47–56 (2018)
Google Scholar
Karaa, A., Goldstein, A.: The spectrum of clinical presentation, diagnosis, and management of mitochondrial forms of diabetes. Pediatr. Diab. 16(1), 1–9 (2015)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koren, Y.: Collaborative filtering with temporal dynamics. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 447–456. ACM (2009)
Google Scholar
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: a survey. Inf. Fusion 37, 132–156 (2017)
Article Google Scholar
Moen, S., Ananiadou, T.S.S.: Distributional semantics resources for biomedical text processing. In: Proceedings of LBM (2013)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)
Google Scholar
Peng, S., You, R., Wang, H., Zhai, C., Mamitsuka, H., Zhu, S.: DeepMeSH: deep semantic representation for improving large-scale mesh indexing. Bioinformatics 32(12), i70–i79 (2016)
Article Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Google Scholar
Street, W.N., Kim, Y.: A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. ACM (2001)
Google Scholar
Sun, J., Li, H.: Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Syst. Appl. 38(3), 2566–2576 (2011)
Article Google Scholar
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 138 (2015)
Article Google Scholar
Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dep. Trinity College Dublin 106(2), 58 (2004)
Google Scholar
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235. ACM (2003)
Google Scholar
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Article Google Scholar
Xun, G., Jha, K., Yuan, Y., Wang, Y., Zhang, A.: MeSHProbeNet: a self-attentive probe net for mesh indexing. Bioinformatics 35, 3794–3802 (2019)
Article Google Scholar
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approximation 26(2), 289–315 (2007)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Yidu Cloud Technology Inc., Beijing, China
Qiao Jin, Haoyang Ding, Linfeng Li, Lei Wang & Jun Yan
Institute of Information Science, Beijing Jiaotong University, Beijing, China
Linfeng Li
The People’s Hospital of Liaoning Province, Shenyang, Liaoning, China
Haitao Huang

Authors

Qiao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Haoyang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Linfeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Haitao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Wang .

Editor information

Editors and Affiliations

Dankook University, Yongin, Korea (Republic of)
Yunmook Nah
Peking University, Haidian, China
Bin Cui
Sungkyunkwan University, Suwon, Korea (Republic of)
Sang-Won Lee
Department of Systems Engineering and En, The Chinese University of Hong Kong, Hong Kong, Hong Kong
Jeffrey Xu Yu
Kangwon National University, Chunchon, Korea (Republic of)
Yang-Sae Moon
Korea Advanced Institute of Science and, Daejeon, Korea (Republic of)
Steven Euijong Whang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, Q., Ding, H., Li, L., Huang, H., Wang, L., Yan, J. (2020). Tackling MeSH Indexing Dataset Shift with Time-Aware Concept Embedding Learning. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12114. Springer, Cham. https://doi.org/10.1007/978-3-030-59419-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-59419-0_29
Published: 22 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59418-3
Online ISBN: 978-3-030-59419-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics