Automated Metadata Suggestion During Repository Submission
Knowledge discovery via an informatics resource is constrained by the completeness of the resource, both in terms of the amount of data it contains and in terms of the metadata that exists to describe the data. Increasing completeness in one of these categories risks reducing completeness in the other because manually curating metadata is time consuming and is restricted by familiarity with both the data and the metadata annotation scheme. The diverse interests of a research community may drive a resource to have hundreds of metadata tags with few examples for each making it challenging for humans or machine learning algorithms to learn how to assign metadata tags properly. We demonstrate with ModelDB, a computational neuroscience model discovery resource, that using manually-curated regular-expression based rules can overcome this challenge by parsing existing texts from data providers during user data entry to suggest metadata annotations and prompt them to suggest other related metadata annotations rather than leaving the task to a curator. In the ModelDB implementation, analyzing the abstract identified 6.4 metadata tags per abstract at 79% precision. Using the full-text produced higher recall with low precision (41%), and the title alone produced few (1.3) metadata annotations per entry; we thus recommend data providers use their abstract during upload. Grouping the possible metadata annotations into categories (e.g. cell type, biological topic) revealed that precision and recall for the different text sources varies by category. Given this proof-of-concept, other bioinformatics resources can likewise improve the quality of their metadata by adopting our approach of prompting data uploaders with relevant metadata at the minimal cost of formalizing rules for each potential metadata annotation.
KeywordsMetadata Repository Data sharing Natural language processing
This study was funded by the NIH grant R01 DC009977. We thank N Ted Carnevale for valuable feedback on the manuscript.
Compliance with Ethical Standards
Conflict of Interest
The authors declare that they have no conflict of interest.
- Kim, M., Park, A. J., Havekes, R., Chay, A., Guercio, L. A., Oliveira, R. F., Abel, T., & Blackwell, K. T. (2011). Colocalization of protein kinase A with adenylyl cyclase enhances protein kinase A activity during induction of long-lasting long-term-potentiation. PLoS Computational Biology, 7, e1002084.CrossRefGoogle Scholar
- McDougal, R. A., Morse, T. M., Carnevale, T., Marenco, L., Wang, R., Migliore, M., Miller, P. L., Shepherd, G. M., & Hines, M. L. (2017). Twenty years of ModelDB and beyond: building essential modeling tools for the future of neuroscience. Journal of Computational Neuroscience, 42(1), 1–10.CrossRefGoogle Scholar
- Morse, T., Carnevale, N. T., Mutalik, P., Migliore, M., & Shepherd, G.M. (2010). Abnormal excitability of oblique dendrites implicated in early Alzheimer’s: a computational study. Frontiers in Neural Circuits, 4, 16. https://doi.org/10.3389/fncir.2010.00016.
- Tirupattur, N., Lapish, C.C., & Mukhopadhyay, S. (2011). Text mining for neuroscience. In American Institute of Physics Conference Series 1371, 118–127. https://doi.org/10.1063/1.3596634.
- Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd international conference on machine learning, 977–984. ACM: Chicago. https://doi.org/10.1145/1143844.1143967.
- Wolf, J. A., Moyer, J. T., Lazarewicz, M. T., Contreras, D., Benoit-Marand, M., O'Donnell, P., & Finkel, L. H. (2005). NMDA-AMPA ratio impacts state transitions and entrainment to oscillations in a computational model of the nucleus accumbens medium spiny projection neuron. The Journal of Neuroscience, 25, 9080–9095.CrossRefGoogle Scholar