, Volume 1, Issue 3, pp 215–237 | Cite as

Text mining neuroscience journal articles to populate neuroscience databases

  • Chiquito J. CrastoEmail author
  • Luis N. Marenco
  • Michele Migliore
  • Buqing Mao
  • Prakash M. Nadkarni
  • Perry Miller
  • Gordon M. Shepherd
Original Article


We have developed a program NeuroText to populate the neuroscience databases in SenseLab ( by mining the natural language text of neuroscience articles. NeuroText uses a two-step approach to identify relevant articles. The first step (pre-processing), aimed at 100% sensitivity, identifies abstracts containing database keywords. In the second step, potentially relveant abstracts identified in the first step are processed for specificity dictated by database architecture, and neuroscience, lexical and semantic contexts. NeuroText results were presented to the experts for validation using a dynamically generated interface that also allows expert-validated articles to be automatically deposited into the databases. Of the test set of 912 articles, 735 were rejected at the pre-processing step. For the remaining articles, the accuracy of predicting database-relevant articles was 85%. Twenty-two articles were erroneously identified. NeuroText deferred decisions on 29 articles to the expert. A comparison of NeuroText results versus the experts’ analyses revealed that the program failed to correctly identify articles’ relevance due to concepts that did not yet exist in the knowledgebase or due to vaguely presented information in the abstracts. NeuroText uses two “evolution” techniques (supervised and unsupervised) that play an important role in the continual improvement of the retrieval results. Software that uses the NeuroText approach can facilitate the creation of curated, special-interest, bibliography databases.

Index Entries

Text mining natural language processing neuroscience databases supervised and unsupervised learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agresti A. (1990) Categorical Data Analysis, Wiley, New York, pp. 59–66.Google Scholar
  2. Aronson A. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. Am. Med. Inform. Assn. Symp. Washington DC, pp. 17–21.Google Scholar
  3. Baeza-Yates R. and Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison-Wesley, New York, pp. 99–114; 191–224.Google Scholar
  4. Barde Y. A., Edgar D. and Thoenen H. (1982) Purification of a new neurotrophic factor from mammalian brain. EMBO. 1, 549–553.Google Scholar
  5. Cantrell A. R., Smith R. D., Goldin A. L., Scheuer T., and Catterall W. A. (1997) Dopaminergic Modulation of Sodium Current in Hippocampal Neurons via cAMP-Dependent Phosphorylation of Specific Sites in the Sodium Channel a Subunit. J. Neurosci. 17, 7330–7338.PubMedGoogle Scholar
  6. Capogna M., McKinney R. A., O’Connor V., Gähwiler B. H., and Thompson S. M. (1997) Ca2+ or Sr2+ Partially Rescues Synaptic Transmission in Hippocampal Cultures Treated with Botulinum Toxin A and C, But Not Tetanus Toxin. J. Neurosci. 17, 7190–7202.PubMedGoogle Scholar
  7. Chen W. R. and Shepherd G. M. (1997) Membrane and synaptic properties of mitral cells in slices of rat olfactory bulb. Brain Res. 745, 189–196.PubMedCrossRefGoogle Scholar
  8. Chiu W. L. A. K., Sze C. N., Ip L. N., Chan S. K. and Au-Yeung S. C. F. (2001) NTDB: Thermodynamic Database for Nucleic Acids. Nucl. Acids Res. 29, 230–233.PubMedCrossRefGoogle Scholar
  9. Cicchetti D. V. and Feinstein A. R. (1990) High aggreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol. 43, 551–558.PubMedCrossRefGoogle Scholar
  10. Claiborne B. J., Amaral D. G., and Cowan W. M. (1986) A light and electron microscopy study analysis of the mossy fibers of the rat dentate gyrus. J. Comp. Neurol. 246, 435–458.PubMedCrossRefGoogle Scholar
  11. Crasto C. J., Marenco L., Miller P. L., and Shepherd G. M. (2002) Olfactory receptor database: a metadata driven automated population from sources of gene and protein sequences. Nucl. Acids Res. 30, 354–360.PubMedCrossRefGoogle Scholar
  12. Friedman C., Alderson P. O., Austin J. H., Cimino J. J., and Johnson S. B. (1994) A general natural language text processor for clinical radiology. J Am Med. Inform. Assn. 1, 161–174.Google Scholar
  13. Friedman C., Jra P., Yu H., Krauthammer M., and Rzhetsky A. (2001) GENIES: a natural-language processing system for extraction of molecular pathways from journal articles. Bioinformatics. 17, S74-S84.PubMedGoogle Scholar
  14. Hersh W. R., Crabtree M. K., Hickman D. H., et al. (2002) Factors Associated with Success in Searching MEDLINE and Applying Evidence to Answer Clinical Questions. J Am Med Inform Assn. 9, 283–293.CrossRefGoogle Scholar
  15. Iliopoulos I., Enright A. J., and Ouzounis C. (2001) TextQuest: Document Clustering of MEDLINE Abstracts for Concept Discovery in Molecular Biology, Pacif. Symp. Biocomp. 6, 374–383.Google Scholar
  16. Justeson J. S. and Katz S. (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat. Lang. Eng. 1, 9–27.CrossRefGoogle Scholar
  17. Karp P. D., Riley M., Paley S. M., Pellegrini-Toole A., and Krumenacker M. (1999) EcoCyc: Encyclopedia of Escherichia coli genes and metabolism. Nucl. Acids Res. 27, 55–58.PubMedCrossRefGoogle Scholar
  18. Kim W., Aronson A. R., and Wilbur W. J. (2001) Automatic MeSH term assignment and quality assessment Proc. Am. Med. Inform. Assn. Symb., Washington DC., pp. 310–323.Google Scholar
  19. Korfhage R. R. (1997) Information Storage and Retrieval, John Wiley and Sons, New York, pp. 105–139, 191–215, 219–231.Google Scholar
  20. Krauthammer M., Rzhetsky A., Morozov P., and Friedman C. (2000) Using BLAST for identifying gene and protein names in journal articles. Gene. 259, 245–252.PubMedCrossRefGoogle Scholar
  21. Lagus K. (2000) Text mining with the WEBSOM. Acta. Polytech. Scand. Math Comput. 110, 1–54.Google Scholar
  22. Marenco L., Nadkarni P. M., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Neuronal database integration: the SenseLab EAV data model. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 102–106.Google Scholar
  23. Migliore M., Morse T. M., Davison A. P., Marenco L., Shepherd G. M., and Hines M. L. (2003) ModelDB: Making Models Publicly Accessible to Support Computational Neuroscience. Neuroinformatics. 1, 135–140.PubMedCrossRefGoogle Scholar
  24. Mori K., Nowycky M. C., and Shepherd G. M. (1981) Electrophysiological analysis of mitral cells in the isolated turtle olfactory bulb. J. Physiol. (Lond.). 314, 281–294.Google Scholar
  25. Mutalik P. G., Deshpande A., and Nadkarni P. (1999) Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents. J. Am. Med. Inform. Assoc. 8, 598–609.Google Scholar
  26. Nadkarni P. M., Marenco L., Chen R., Skoufos E., Shepherd G. M., and Miller P. L. (1999) Organization of Heterogeneous Scientific Data Using the EAV/CR Representation. J. Am. Med. Inform. Assn. 6, 478–493.Google Scholar
  27. Pinker S. (1994) The Language Instinct, Harper-Collins, London, pp. 177–178.Google Scholar
  28. Prager J. M. (1999) Linguini: Language Indentification for Multilingual Documents. Proc. 32nd Hawaii Int. Sys. 1–11.Google Scholar
  29. Qian J., Colmers W. F., and Saggau P. (1997) Inhibition of Synaptic Transmission by Neuropeptide Y in Rat Hippocampal Area CA1: Modulation of Presynaptic Ca2+ Entry. J Neurosci. 17, 8169–8177.PubMedGoogle Scholar
  30. Raghavan V. V., Jung G. S., and Bolling P. (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM. Tr. Inform. Sys. 7, 205–229.CrossRefGoogle Scholar
  31. Schomburg I., Chang A., and Schomburg D. (2002) BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 30, 47–49.PubMedCrossRefGoogle Scholar
  32. Shepherd G. M., Mirsky J. S., Healy M. D., et al. (1998) The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data. Trends Neurosci. 21, 460–468.PubMedCrossRefGoogle Scholar
  33. Spitzer R. and Fleiss J. (1982) A design-independent method for measuring the reliability of psychiatric diagnosis. J. Psychiat. Res. 17, 335–342.Google Scholar
  34. Sun Q.-Q. and Dale N. (1998) Differential inhibition of N and P/Q Ca2+ currents by 5HT1A and 5HT1D receptors in spinal neurons of Xenopus larvae. J. Physiol. 510, 103–120.PubMedCrossRefGoogle Scholar
  35. Tague-Sutcliffe J. (1992) Measuring the informativeness of a retrieval process. Proc. 15th Ann. Intern. ACM SIGIR Conf. Res. Dev. Inform. Retrieval. Denmark. pp. 23–36.Google Scholar
  36. Toth Z., Hollrigel G. S., Gorcs T., and Soltesz, I. (1997) Instantaneous Perturbation of Dentate Interneuronal Networks by a Pressure Wave-Transient Delivered to the Neocortex. J. Neurosci. 17, 8106–8117.PubMedGoogle Scholar
  37. Weeber M., Mork J. and Aronson A. R. (2001) Developing a test collection for biomedical word sense disambiguation. Proc. Am. Med. Inform. Assn. Symp. Washington DC, 746–750.Google Scholar

Copyright information

© Humana Press Inc 2003

Authors and Affiliations

  • Chiquito J. Crasto
    • 1
    • 2
    Email author
  • Luis N. Marenco
    • 1
  • Michele Migliore
    • 2
    • 5
  • Buqing Mao
    • 1
  • Prakash M. Nadkarni
    • 1
  • Perry Miller
    • 1
    • 3
    • 4
  • Gordon M. Shepherd
    • 2
  1. 1.Center for Medical InformaticsYale UniversityNew Haven
  2. 2.Department of NeurobiologyYale UniversityNew Haven
  3. 3.Department of AnesthesiologyYale UniversityNew Haven
  4. 4.Department of Molecular, Cellular, and Developmental BiologyYale UniversityNew Haven
  5. 5.Institute of Biophysics, National Research CouncilPalermoItaly

Personalised recommendations