Mining Protein Interactions from Text Using Convolution Kernels

Narayanan, Ramanathan; Misra, Sanchit; Lin, Simon; Choudhary, Alok

doi:10.1007/978-3-642-14640-4_9

Ramanathan Narayanan²⁷,
Sanchit Misra²⁷,
Simon Lin²⁸ &
…
Alok Choudhary²⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5669))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

627 Accesses

Abstract

As the sizes of biomedical literature databases increase, there is an urgent need to develop intelligent systems that automatically discover Protein-Protein interactions from text. Despite resource-intensive efforts to create manually curated interaction databases, the sheer volume of biological literature databases makes it impossible to achieve significant coverage. In this paper, we describe a scalable hierarchical Support Vector Machine(SVM) based framework to efficiently mine protein interactions with high precision. In addition, we describe a convolution tree-vector kernel based on syntactic similarity of natural language text to further enhance the mining process. By using the inherent syntactic similarity of interaction phrases as a kernel method, we are able to significantly improve the classification quality. Our hierarchical framework allows us to reduce the search space dramatically with each stage, while sustaining a high level of accuracy. We test our framework on a corpus of over 10000 manually annotated phrases gathered from various sources. The convolution kernel technique identifies sentences describing interactions with a precision of 95% and a recall of 92%, yielding significant improvements over previous machine learning techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alfarano, C., et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33, D418–D424 (2005)
Article Google Scholar
Blaschke, C., et al.: Automatic extraction of biological information from scientific text: protein-protein interactions. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., pp. 60–67 (1999)
Google Scholar
Brown, K.R., et al.: Online predicted human interaction database. Bioinformatics 21, 2076–2082 (2005)
Article Google Scholar
Chatr-aryamontri, A., et al.: MINT: the Molecular INTeraction database. Nucleic Acids Res. 35, D572–D574 (2007)
Article Google Scholar
Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures (2002)
Google Scholar
Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Computational Linguistics (2003)
Google Scholar
Donaldson, I., et al.: PreBIND and Textomy–mining the biomedical literature for proteinprotein interactions using a support vector machine. BMC Bioinformatics 4, 11 (2003)
Article Google Scholar
Fukuda, K., et al.: Toward information extraction: identifying protein names from biological papers. In: Pac. Symp. Biocomput., pp. 707–718 (1998)
Google Scholar
Genia Project: Mining literature for knowledge in molecular biology (2008), http://wwwtsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi
Gilfillan, I.: A database of proteins that are known to interact. Genome Biology 1; Reports220 (November 2000)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods-Support Vector Learning (1999)
Google Scholar
Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on SVMs. J. Bio. med. Inform. 37, 436–447 (2004)
Article Google Scholar
Marcotte, E.M., et al.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)
Article Google Scholar
Ramani, A.K., et al.: Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol. 6, R40 (2005)
Article Google Scholar
Rosario, B., Hearst, A.: Multi-way Relation Classification: Application to Protein-Protein Interaction. In: Human Language Technology Conference on Empirical Methods in Natural Language Processing (2005)
Google Scholar
Rindflesch, T.C., et al.: Mining molecular binding terminology from biomedical text. In: Proc. AMIA Symp., pp. 127–131 (1999)
Google Scholar
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19, 2046–2053 (2003)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Yu, H., et al.: Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. In: Proc. AMIA Symp., pp. 919–923 (2002)
Google Scholar
Culotta, A., Sorensen, J.: Dependency Tree Kernels for Relation Extraction. In: Proceedings of ACL 2004 (2004)
Google Scholar
Bunescu, R., Mooney, R.J.: Subsequence kernels for relation extraction. In: Proceedings of the 19th Conference on Neural Information Processing Systems, Vancouver, British Columbia (2005)
Google Scholar
Collins, M., Duffy, N.: Convolution kernels for natural language. In: NIPS 2001 (2001)
Google Scholar
Yuka, T., Tsujii, J.: Part-of-Speech Annotation of Biology Research Abstracts. In: The Proceedings of 4th International Conference on Language Resource and Evaluation (LREC 2004), Lisbon, Portugal, May 2004, pp. 1267–1270 (2004)
Google Scholar
Collins, M.: A New Statistical Parser Based on Bigram Lexical Dependencies. In: Proceedings of the 34th Annual Meeting of the ACL, Santa Cruz
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Biocreative 2: http://biocreative.sourceforge.net/biocreative_2.html
Shin, et al.: Identifying Protein-Protein Interaction Sentences Using Boosting and Kernel Method. In: Second BioCreative Challenge Evaluation Workshop (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, Northwestern University,
Ramanathan Narayanan, Sanchit Misra & Alok Choudhary
Feinberg School of Medicine, Northwestern University,
Simon Lin

Authors

Ramanathan Narayanan
View author publications
You can also search for this author in PubMed Google Scholar
Sanchit Misra
View author publications
You can also search for this author in PubMed Google Scholar
Simon Lin
View author publications
You can also search for this author in PubMed Google Scholar
Alok Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Thammasat University, Sirindhorn International Institute of Technology,, 131 Moo 5 Tiwanont Road, Bangkadi, 12000, Muang, Pathumthani, Thailand
Thanaruk Theeramunkong
Department of Architecture for Intelligence, The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka,Ibaraki, 567-0047, Osaka, Japan
Cholwich Nattee
Center for Informatics, Federal University of Pernambuco, Brazil
Paulo J. L. Adeodato
Computer Science and Engineering Department, University of Notre Dame, 353 Fitzpatrick Hall, 46556, Notre Dame, IN, USA
Nitesh Chawla
Department of Computer Science, The Australian National University, Australia
Peter Christen
TELECOM Bretagne, Lab-STICC, Institut TELECOM, Brest, France
Philippe Lenca
School of Information Technologies, University of Sydney, P.O. Box, Australia
Josiah Poon
Australian Taxation Office, Australia
Graham Williams

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Narayanan, R., Misra, S., Lin, S., Choudhary, A. (2010). Mining Protein Interactions from Text Using Convolution Kernels. In: Theeramunkong, T., et al. New Frontiers in Applied Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14640-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-14640-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14639-8
Online ISBN: 978-3-642-14640-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics