Text mining neuroscience journal articles to populate neuroscience databases

Abstract

We have developed a program NeuroText to populate the neuroscience databases in SenseLab (http://senselab.med.yale.edu/senselab) by mining the natural language text of neuroscience articles. NeuroText uses a two-step approach to identify relevant articles. The first step (pre-processing), aimed at 100% sensitivity, identifies abstracts containing database keywords. In the second step, potentially relveant abstracts identified in the first step are processed for specificity dictated by database architecture, and neuroscience, lexical and semantic contexts. NeuroText results were presented to the experts for validation using a dynamically generated interface that also allows expert-validated articles to be automatically deposited into the databases. Of the test set of 912 articles, 735 were rejected at the pre-processing step. For the remaining articles, the accuracy of predicting database-relevant articles was 85%. Twenty-two articles were erroneously identified. NeuroText deferred decisions on 29 articles to the expert. A comparison of NeuroText results versus the experts’ analyses revealed that the program failed to correctly identify articles’ relevance due to concepts that did not yet exist in the knowledgebase or due to vaguely presented information in the abstracts. NeuroText uses two “evolution” techniques (supervised and unsupervised) that play an important role in the continual improvement of the retrieval results. Software that uses the NeuroText approach can facilitate the creation of curated, special-interest, bibliography databases.

This is a preview of subscription content, log in to check access.

References

  1. Agresti A. (1990) Categorical Data Analysis, Wiley, New York, pp. 59–66.

    Google Scholar 

  2. Aronson A. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. Am. Med. Inform. Assn. Symp. Washington DC, pp. 17–21.

  3. Baeza-Yates R. and Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison-Wesley, New York, pp. 99–114; 191–224.

    Google Scholar 

  4. Barde Y. A., Edgar D. and Thoenen H. (1982) Purification of a new neurotrophic factor from mammalian brain. EMBO. 1, 549–553.

    CAS  Google Scholar 

  5. Cantrell A. R., Smith R. D., Goldin A. L., Scheuer T., and Catterall W. A. (1997) Dopaminergic Modulation of Sodium Current in Hippocampal Neurons via cAMP-Dependent Phosphorylation of Specific Sites in the Sodium Channel a Subunit. J. Neurosci. 17, 7330–7338.

    PubMed  CAS  Google Scholar 

  6. Capogna M., McKinney R. A., O’Connor V., Gähwiler B. H., and Thompson S. M. (1997) Ca2+ or Sr2+ Partially Rescues Synaptic Transmission in Hippocampal Cultures Treated with Botulinum Toxin A and C, But Not Tetanus Toxin. J. Neurosci. 17, 7190–7202.

    PubMed  CAS  Google Scholar 

  7. Chen W. R. and Shepherd G. M. (1997) Membrane and synaptic properties of mitral cells in slices of rat olfactory bulb. Brain Res. 745, 189–196.

    PubMed  Article  CAS  Google Scholar 

  8. Chiu W. L. A. K., Sze C. N., Ip L. N., Chan S. K. and Au-Yeung S. C. F. (2001) NTDB: Thermodynamic Database for Nucleic Acids. Nucl. Acids Res. 29, 230–233.

    PubMed  Article  CAS  Google Scholar 

  9. Cicchetti D. V. and Feinstein A. R. (1990) High aggreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43, 551–558.

    PubMed  Article  CAS  Google Scholar 

  10. Claiborne B. J., Amaral D. G., and Cowan W. M. (1986) A light and electron microscopy study analysis of the mossy fibers of the rat dentate gyrus. J. Comp. Neurol. 246, 435–458.

    PubMed  Article  CAS  Google Scholar 

  11. Crasto C. J., Marenco L., Miller P. L., and Shepherd G. M. (2002) Olfactory receptor database: a metadata driven automated population from sources of gene and protein sequences. Nucl. Acids Res. 30, 354–360.

    PubMed  Article  CAS  Google Scholar 

  12. Friedman C., Alderson P. O., Austin J. H., Cimino J. J., and Johnson S. B. (1994) A general natural language text processor for clinical radiology. J Am Med. Inform. Assn. 1, 161–174.

    CAS  Google Scholar 

  13. Friedman C., Jra P., Yu H., Krauthammer M., and Rzhetsky A. (2001) GENIES: a natural-language processing system for extraction of molecular pathways from journal articles. Bioinformatics. 17, S74-S84.

    PubMed  Google Scholar 

  14. Hersh W. R., Crabtree M. K., Hickman D. H., et al. (2002) Factors Associated with Success in Searching MEDLINE and Applying Evidence to Answer Clinical Questions. J Am Med Inform Assn. 9, 283–293.

    Article  Google Scholar 

  15. Iliopoulos I., Enright A. J., and Ouzounis C. (2001) TextQuest: Document Clustering of MEDLINE Abstracts for Concept Discovery in Molecular Biology, Pacif. Symp. Biocomp. 6, 374–383.

    Google Scholar 

  16. Justeson J. S. and Katz S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1, 9–27.

    Article  Google Scholar 

  17. Karp P. D., Riley M., Paley S. M., Pellegrini-Toole A., and Krumenacker M. (1999) EcoCyc: Encyclopedia of Escherichia coli genes and metabolism. Nucl. Acids Res. 27, 55–58.

    PubMed  Article  CAS  Google Scholar 

  18. Kim W., Aronson A. R., and Wilbur W. J. (2001) Automatic MeSH term assignment and quality assessment Proc. Am. Med. Inform. Assn. Symb., Washington DC., pp. 310–323.

  19. Korfhage R. R. (1997) Information Storage and Retrieval, John Wiley and Sons, New York, pp. 105–139, 191–215, 219–231.

    Google Scholar 

  20. Krauthammer M., Rzhetsky A., Morozov P., and Friedman C. (2000) Using BLAST for identifying gene and protein names in journal articles. Gene. 259, 245–252.

    PubMed  Article  CAS  Google Scholar 

  21. Lagus K. (2000) Text mining with the WEBSOM. Acta. Polytech. Scand. Math Comput. 110, 1–54.

    Google Scholar 

  22. Marenco L., Nadkarni P. M., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Neuronal database integration: the SenseLab EAV data model. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 102–106.

  23. Migliore M., Morse T. M., Davison A. P., Marenco L., Shepherd G. M., and Hines M. L. (2003) ModelDB: Making Models Publicly Accessible to Support Computational Neuroscience. Neuroinformatics. 1, 135–140.

    PubMed  Article  Google Scholar 

  24. Mori K., Nowycky M. C., and Shepherd G. M. (1981) Electrophysiological analysis of mitral cells in the isolated turtle olfactory bulb. J. Physiol. (Lond.). 314, 281–294.

    CAS  Google Scholar 

  25. Mutalik P. G., Deshpande A., and Nadkarni P. (1999) Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents. J. Am. Med. Inform. Assoc. 8, 598–609.

    Google Scholar 

  26. Nadkarni P. M., Marenco L., Chen R., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Organization of Heterogeneous Scientific Data Using the EAV/CR Representation. J. Am. Med. Inform. Assn. 6, 478–493.

    CAS  Google Scholar 

  27. Pinker S. (1994) The Language Instinct, Harper-Collins, London, pp. 177–178.

    Google Scholar 

  28. Prager J. M. (1999) Linguini: Language Indentification for Multilingual Documents. Proc. 32nd Hawaii Int. Sys. 1–11.

  29. Qian J., Colmers W. F., and Saggau P. (1997) Inhibition of Synaptic Transmission by Neuropeptide Y in Rat Hippocampal Area CA1: Modulation of Presynaptic Ca2+ Entry. J Neurosci. 17, 8169–8177.

    PubMed  CAS  Google Scholar 

  30. Raghavan V. V., Jung G. S., and Bolling P. (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM. Tr. Inform. Sys. 7, 205–229.

    Article  Google Scholar 

  31. Schomburg I., Chang A., and Schomburg D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30, 47–49.

    PubMed  Article  CAS  Google Scholar 

  32. Shepherd G. M., Mirsky J. S., Healy M. D., et al. (1998) The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends Neurosci. 21, 460–468.

    PubMed  Article  CAS  Google Scholar 

  33. Spitzer R. and Fleiss J. (1982) A design-independent method for measuring the reliability of psychiatric diagnosis. J. Psychiat. Res. 17, 335–342.

    Google Scholar 

  34. Sun Q.-Q. and Dale N. (1998) Differential inhibition of N and P/Q Ca2+ currents by 5HT1A and 5HT1D receptors in spinal neurons of Xenopus larvae. J. Physiol. 510, 103–120.

    PubMed  Article  CAS  Google Scholar 

  35. Tague-Sutcliffe J. (1992) Measuring the informativeness of a retrieval process. Proc. 15th Ann. Intern. ACM SIGIR Conf. Res. Dev. Inform. Retrieval. Denmark. pp. 23–36.

  36. Toth Z., Hollrigel G. S., Gorcs T., and Soltesz, I. (1997) Instantaneous Perturbation of Dentate Interneuronal Networks by a Pressure Wave-Transient Delivered to the Neocortex. J. Neurosci. 17, 8106–8117.

    PubMed  CAS  Google Scholar 

  37. Weeber M., Mork J. and Aronson A. R. (2001) Developing a test collection for biomedical word sense disambiguation. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 746–750.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chiquito J. Crasto.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Crasto, C.J., Marenco, L.N., Migliore, M. et al. Text mining neuroscience journal articles to populate neuroscience databases. Neuroinform 1, 215–237 (2003). https://doi.org/10.1385/NI:1:3:215

Download citation

Index Entries

  • Text mining
  • natural language processing
  • neuroscience
  • databases
  • supervised and unsupervised learning