Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network

  • Hui Yang
  • Rajesh Swaminathan
  • Abhishek Sharma
  • Vilas Ketkar
  • Jason D‘Silva
Part of the Studies in Computational Intelligence book series (SCI, volume 375)


Advances in bio-technology and life sciences are leading to an ever-increasing volume of published research data, predominantly in unstructured text. To uncover the underlying knowledge base hidden in such data, text mining techniques have been utilized. Past and current efforts in this area have been largely focusing on recognizing gene and protein names, and identifying binary relationships among genes or proteins. In this chapter, we present an information extraction system that analyzes publications in an emerging discipline–Nutritional Genomics, a discipline that studies the interactions amongst genes, foods and diseases–aiming to build a quantitative food-disease-gene network. To this end, we adopt a host of techniques including natural language processing (NLP) techniques, domain ontology, and machine learning approaches.

Specifically, the proposed system is composed of four main modules: (1) named entity recognition, which extracts five types of entities including foods, chemicals, diseases, proteins and genes; (2) relationship extraction: A verb-centric approach is implemented to extract binary relationships between two entities; (3) relationship polarity and strength analysis: We have constructed novel features to capture the syntactic, semantic and structural aspects of a relationship. A 2-phase Support Vector Machine is then used to classify the polarity, whereas a Support Vector Regression learner is applied to rate the strength level of a relationship; and (4) relationship integration and visualization, which integrates the previously extracted relationships and realizes a preliminary user interface for intuitive observation and exploration.

Empirical evaluations of the first three modules demonstrate the efficacy of this system. The entity recognition module achieved a balanced precision and recall with an average F-score of 0.89. The average F-score of the extracted relationships is 0.905. Finally, an accuracy of 0.91 and 0.96 was achieved in classifying the relationship polarity and rating the relationship strength level, respectively..


Support Vector Regression Noun Phrase Opinion Mining Parse Tree Name Entity Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: The metamap program (2001)Google Scholar
  2. 2.
    Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110. ACM, New York (2010)CrossRefGoogle Scholar
  3. 3.
    Corney, D.P., Buxton, B.F., Langdon, W.B., Jones, D.T.: Biorat: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)CrossRefGoogle Scholar
  4. 4.
    de Marneffe, M.-C., MacCartney, B., Manning, C. D.: Generating typed dependency parses from phrase structure trees. In: LREC (2006)Google Scholar
  5. 5.
    Denecke, K.: Semantic structuring of and information extraction from medical documents using the umls. Methods of Information in Medicine 47(5), 425–434 (2008)Google Scholar
  6. 6.
    Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of First ACM International Conference on Web Search and Data Mining, WSDM 2008 (2008)Google Scholar
  7. 7.
    Feldman, R., Regev, Y., Finkelstein-Landau, M., Hurvitz, E., Kogan, B.: Mining biomedical literature using information extraction. Current Drug Discovery (2002)Google Scholar
  8. 8.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  9. 9.
    Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing (1998)Google Scholar
  10. 10.
    Fundel, K., Kffner, R., Zimmer, R.: Relex - relation extraction using dependency parse trees. Bioinformatics 23(3), 365–371 (2007)CrossRefGoogle Scholar
  11. 11.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)zbMATHGoogle Scholar
  12. 12.
    Hakenberg, J., Leaman, R., Vo, N.H., Jonnalagadda, S., Sullivan, R., Miller, C., Tari, L., Baral, C., Gonzalez, G.: Efficient extraction of protein-protein interactions from full-text articles. IEEE/ACM Trans. Comput. Biology Bioinform. 7(3), 481–494 (2010)CrossRefGoogle Scholar
  13. 13.
    Heer, J., Card, S.K., Landay, J.A.: Prefuse: a toolkit for interactive information visualization. In: Proc. CHI 2005, Human Factors in Computing Systems (2005)Google Scholar
  14. 14.
    Hoffmann, R., Valencia, A.: A gene network for navigating the literature. Nature Genetics 36, 664 (2004)CrossRefGoogle Scholar
  15. 15.
    Hu, X., Wu, D.D.: Data mining and predictive modeling of biomolecular network from biomedical literature databases. IEEE/ACM Trans. Comp. Biol. Bioinf. 4(2), 251–263 (2006)CrossRefGoogle Scholar
  16. 16.
    Hur, J., Schuyler, A.D., States, D.J., Feldman, E.L.: Sciminer: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 840, 838–840 (2009)CrossRefGoogle Scholar
  17. 17.
    Kim, J.j., Zhang, Z., Park, J.C., Ng, S.-K.: Biocontrasts: Extracting and exploiting protein-protein contrastive relations from biomedical literature. Bioinformatics 22(5), 597–605 (2006)CrossRefGoogle Scholar
  18. 18.
    Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., von Mering, C.: String 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 37(database issue), 412–416 (2009)CrossRefGoogle Scholar
  19. 19.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)Google Scholar
  20. 20.
    Kaput, J., Rodriguez, R.: Nutritional genomics: the next frontier in the postgenomic era. Physiological Genomics 16, 166–177 (2004)CrossRefGoogle Scholar
  21. 21.
    Kipper, K., Korhonen, A., Ryant, N., Palmer, M.: Extending verbnet with novel verb classes. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation LREC (January 2006),
  22. 22.
    Kobayashi, N., Iida, R., Inui, K., Matsumoto, Y.: Opinion mining on the web by extracting subject-attribute-value relations. In: Proceedings of AAAI 2006 Spring Sympoia on Computational Approaches to Analyzing Weblogs AAAI-CAAW 2006 (2006)Google Scholar
  23. 23.
    Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the second biocreative community challenge. Genome Biol. 9(suppl. 2), S1 (2008)CrossRefGoogle Scholar
  24. 24.
    Kuhn, M., Szklarczyk, D., Franceschini, A., Campillos, M., von Mering, C., Jensen, L.J., Beyer, A., Bork, P.: Stitch 2: an interaction network database for small molecules and proteins. Nucleic Acids Research, 1–5 (2009)Google Scholar
  25. 25.
    Lindsay, R.K., Gordon, M.D.: Literature-based discovery by lexical statistics. Journal of the American Society for Information Science 50(7), 574–587 (1999)CrossRefGoogle Scholar
  26. 26.
    Liu, B.: Sentiment analysis and subjectivity. In: Handbook of Natural Language Processing, 2nd edn. (2010)Google Scholar
  27. 27.
    Maskarinec, G., Morimoto, Y., Novotny, R., Nordt, F.J., Stanczyk, F.Z., Franke, A.A.: Urinary sex steroid excretion levels during a soy intervention among young girls: a pilot study. Nut. Can. 52(1), 22–28 (2005)CrossRefGoogle Scholar
  28. 28.
    McDonald, R., Hannan, K., Neylon, T., Wells, M., Reynar, J.: Structured models for fine-to-coarse sentiment analysis. In: Proceedings of the Association for Computational Linguistics (ACL), pp. 432–439. Association for Computational Linguistics, Prague (2007)Google Scholar
  29. 29.
  30. 30.
    Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.X.: Topic sentiment mixture: Modeling facets and opinions in weblogs. In: Proceedings of WWW, pp. 171–180. ACM Press, New York (2007)Google Scholar
  31. 31.
    Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004)CrossRefGoogle Scholar
  32. 32.
    Niu, Y., Zhu, X., Li, J., Hirst, G.: Analysis of polarity information in medical text. In: Proceedings of the American Medical Informatics Association (2005)Google Scholar
  33. 33.
    Ohta, T., Tateisi, Y., Kim, J.D.: The genia corpus: An annotated research abstract corpus in molecular biology domain. In: The Human Language Technology Conference (2002)Google Scholar
  34. 34.
    Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the bioentity recognition task at jnlpba. In: Kim, J.-D. (ed.) Proc. International Joint Workshop on Natural Language Processing in Biomedicine its Applications (2004)Google Scholar
  35. 35.
    OpenSource. Opennlp (2010),
  36. 36.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: EMNLP 2002: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, pp. 79–86. Association for Computational Linguistics, Morristown (2002)CrossRefGoogle Scholar
  37. 37.
    Sahay, S., Mukherjea, S., Agichtein, E., Garcia, E.V., Navathe, S.B., Ram, A.: Discovering semantic biomedical relations utilizing the web. ACM Trans. Knowl. Discov. Data 2(1), 1–15 (2008)CrossRefGoogle Scholar
  38. 38.
    Sauvaget, C., Lagarde, F.: Lifestyle factors, radiation and gastric cancer in atomic-bomb survivors in japan. Cancer Causes Control 16(7), 773–780 (2005)CrossRefGoogle Scholar
  39. 39.
    Settles, B.: ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)CrossRefGoogle Scholar
  40. 40.
    Sharma, A., Swaminathan, R., Yang, H.: A verb-centric approach for relationship extraction in biomedical text. In: The fourth IEEE International Conference on Semantic Computing, ICSC 2010 (September 2010)Google Scholar
  41. 41.
    Smalheiser, N.R., Swanson, D.R.: Using arrowsmith: a computer-assisted approach to formulating and assessing scientific hypotheses. Computer Methods Programs in Biomedicine 57, 149–153 (1998)CrossRefGoogle Scholar
  42. 42.
    Srinivasan, P.: Text mining: Generating hypotheses from medline. J. Amer. Soc. Inf. Sci. Tech. 55(5), 396–413 (2004)CrossRefGoogle Scholar
  43. 43.
    Swaminathan, R., Sharma, A., Yang, H.: Opinion mining for biomedical text data: Feature space design and feature selection. In: The Nineth International Workshop on Data Mining in Bioinformatics, BIOKDD 2010 (July 2010)Google Scholar
  44. 44.
    Swanson, D.R.: Fish oil, raynauds syndrome, and undiscovered public knowledge. Perspect. Bio. Med. 30, 7–18 (1986)Google Scholar
  45. 45.
    Tanabe, L.K., John Wilbur, W.: Tagging gene and protein names in full text articles. In: ACL Workshop on Nat. Lang. Proc. in the Biomedical Domain, pp. 9–13 (2002)Google Scholar
  46. 46.
    Tanenblatt, M., Coden, A., Sominsky, I.: The conceptmapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)Google Scholar
  47. 47.
    Torvik, V.I., Smalheiser, N.R.: A quantitative model for linking two disparate sets of articles in medline. Bioinformatics 23(13), 1658–1665 (2007)CrossRefGoogle Scholar
  48. 48.
    Tsai, R., Chou, W.-C., Su, Y.-S., Lin, Y.-C., Sung, C.-L., Dai, H.-J., Yeh, I., Ku, W., Sung, T.-Y., Hsu, W.-L.: Biosmile: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics 8(1), 325 (2007)CrossRefGoogle Scholar
  49. 49.
    USDA. Usda national nutrient database for standard reference, release 17 (2006),
  50. 50.
    Waldschlger, J., Bergemann, C., Ruth, W., Effmert, U., Jeschke, U., Richter, D.U., Kragl, U., Piechulla, B., Briese, V.: Flax-seed extracts with phytoestrogenic effects on a hormone receptor-positive tumour cell line. Anticancer Res. 25(3A), 1817–1822 (2005)Google Scholar
  51. 51.
    Wang, Y., Patrick Cascading, J.: classifiers for named entity recognition in clinical notes. In: Workshop Biomedical Information Extraction, pp. 42–49 (2009)Google Scholar
  52. 52.
    Yang, C.S., Wang, X.: Green tea and cancer prevention. Nutr Cancer 62(7), 931–937 (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Hui Yang
    • 1
  • Rajesh Swaminathan
    • 1
  • Abhishek Sharma
    • 1
  • Vilas Ketkar
    • 1
  • Jason D‘Silva
    • 1
  1. 1.Department of Computer ScienceSan Francisco State UniversityUSA

Personalised recommendations