Skip to main content

Mining Protein Interactions from Text Using Convolution Kernels

  • Conference paper
New Frontiers in Applied Data Mining (PAKDD 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5669))

Included in the following conference series:

  • 627 Accesses

Abstract

As the sizes of biomedical literature databases increase, there is an urgent need to develop intelligent systems that automatically discover Protein-Protein interactions from text. Despite resource-intensive efforts to create manually curated interaction databases, the sheer volume of biological literature databases makes it impossible to achieve significant coverage. In this paper, we describe a scalable hierarchical Support Vector Machine(SVM) based framework to efficiently mine protein interactions with high precision. In addition, we describe a convolution tree-vector kernel based on syntactic similarity of natural language text to further enhance the mining process. By using the inherent syntactic similarity of interaction phrases as a kernel method, we are able to significantly improve the classification quality. Our hierarchical framework allows us to reduce the search space dramatically with each stage, while sustaining a high level of accuracy. We test our framework on a corpus of over 10000 manually annotated phrases gathered from various sources. The convolution kernel technique identifies sentences describing interactions with a precision of 95% and a recall of 92%, yielding significant improvements over previous machine learning techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alfarano, C., et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33, D418–D424 (2005)

    Article  Google Scholar 

  2. Blaschke, C., et al.: Automatic extraction of biological information from scientific text: protein-protein interactions. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., pp. 60–67 (1999)

    Google Scholar 

  3. Brown, K.R., et al.: Online predicted human interaction database. Bioinformatics 21, 2076–2082 (2005)

    Article  Google Scholar 

  4. Chatr-aryamontri, A., et al.: MINT: the Molecular INTeraction database. Nucleic Acids Res. 35, D572–D574 (2007)

    Article  Google Scholar 

  5. Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures (2002)

    Google Scholar 

  6. Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Computational Linguistics (2003)

    Google Scholar 

  7. Donaldson, I., et al.: PreBIND and Textomy–mining the biomedical literature for proteinprotein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003)

    Article  Google Scholar 

  8. Fukuda, K., et al.: Toward information extraction: identifying protein names from biological papers. In: Pac. Symp. Biocomput., pp. 707–718 (1998)

    Google Scholar 

  9. Genia Project: Mining literature for knowledge in molecular biology (2008), http://wwwtsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi

  10. Gilfillan, I.: A database of proteins that are known to interact. Genome Biology 1; Reports220 (November 2000)

    Google Scholar 

  11. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  12. Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods-Support Vector Learning (1999)

    Google Scholar 

  13. Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on SVMs. J. Bio. med. Inform. 37, 436–447 (2004)

    Article  Google Scholar 

  14. Marcotte, E.M., et al.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)

    Article  Google Scholar 

  15. Ramani, A.K., et al.: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, R40 (2005)

    Article  Google Scholar 

  16. Rosario, B., Hearst, A.: Multi-way Relation Classification: Application to Protein-Protein Interaction. In: Human Language Technology Conference on Empirical Methods in Natural Language Processing (2005)

    Google Scholar 

  17. Rindflesch, T.C., et al.: Mining molecular binding terminology from biomedical text. In: Proc. AMIA Symp., pp. 127–131 (1999)

    Google Scholar 

  18. Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19, 2046–2053 (2003)

    Article  Google Scholar 

  19. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  20. Yu, H., et al.: Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. In: Proc. AMIA Symp., pp. 919–923 (2002)

    Google Scholar 

  21. Culotta, A., Sorensen, J.: Dependency Tree Kernels for Relation Extraction. In: Proceedings of ACL 2004 (2004)

    Google Scholar 

  22. Bunescu, R., Mooney, R.J.: Subsequence kernels for relation extraction. In: Proceedings of the 19th Conference on Neural Information Processing Systems, Vancouver, British Columbia (2005)

    Google Scholar 

  23. Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS 2001 (2001)

    Google Scholar 

  24. Yuka, T., Tsujii, J.: Part-of-Speech Annotation of Biology Research Abstracts. In: The Proceedings of 4th International Conference on Language Resource and Evaluation (LREC 2004), Lisbon, Portugal, May 2004, pp. 1267–1270 (2004)

    Google Scholar 

  25. Collins, M.: A New Statistical Parser Based on Bigram Lexical Dependencies. In: Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz

    Google Scholar 

  26. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  27. Biocreative 2: http://biocreative.sourceforge.net/biocreative_2.html

  28. Shin, et al.: Identifying Protein-Protein Interaction Sentences Using Boosting and Kernel Method. In: Second BioCreative Challenge Evaluation Workshop (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Narayanan, R., Misra, S., Lin, S., Choudhary, A. (2010). Mining Protein Interactions from Text Using Convolution Kernels. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14640-4_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14639-8

  • Online ISBN: 978-3-642-14640-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics