Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

Polajnar, Tamara; Girolami, Mark

doi:10.1007/978-3-642-04031-3_24

Tamara Polajnar²⁴ &
Mark Girolami²⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5780))

Included in the following conference series:

IAPR International Conference on Pattern Recognition in Bioinformatics

977 Accesses

Abstract

Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process.

Download to read the full chapter text

Chapter PDF

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Article Open access 25 July 2016

Improving Document Prioritization for Protein-Protein Interaction Extraction Using Shallow Linguistics and Word Embeddings

Learning Bayesian Network Using Parse Trees for Extraction of Protein-Protein Interaction

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Airola, A., Pyysalo, S., Björne, J., Pahikkala, T., Ginter, F., Salakoski, T.: All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9(suppl. 11) (2008)
Google Scholar
Azzopardi, L., Girolami, M., Crowe, M.: Probabilistic hyperspace analogue to language. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 575–576. ACM, New York (2005)
Chapter Google Scholar
Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)
Article PubMed Google Scholar
Burgess, C., Livesay, K., Lund, K.: Explorations in context space: Words, sentences, discourse. Discourse Processes 25, 211–257 (1998)
Google Scholar
Burgess, C., Lund, K.: Modeling parsing constraints with high-dimensional context space. In: Language and Cognitive Processes, vol. 12, pp. 177–210 (1997)
Google Scholar
Cohen, K.B., Fox, L., Ogren, P.V., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases: mining biological semantics, pp. 38–45 (2005)
Google Scholar
Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)
Google Scholar
Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007)
Google Scholar
Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006)
Article Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Jones, M.N., Kintsch, W., Mewhort, D.J.: High-dimensional semantic space accounts of priming. Journal of Memory and Language 55(4), 534–552 (2006)
Article Google Scholar
Jones, M.N., Mewhort, D.J.K.: Representing word meaning and order information in a composite holographic lexicon. Psychological Review 114, 1–37 (2007)
Article PubMed Google Scholar
Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003)
Article Google Scholar
Krallinger, M., Leitner, F., Rodriguez-Penagos, C., Valencia, A.: Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome. Biol. 9(suppl. 2) (2008)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers 28, 203–208 (1996)
Article Google Scholar
Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001)
Article CAS PubMed Google Scholar
Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: SYNASC 2007: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Washington, DC, USA, pp. 157–164. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Padó, S., Lapata, M.: Dependency-based construction of semantic space models. Comput. Linguist. 33(2), 161–199 (2007)
Article Google Scholar
Polajnar, T., Rogers, S., Girolami, M.: An evaluation of gaussian processes for sentence classification and protein interaction detection. Technical report, University of Glasgow, Department of Computing Science (2008)
Google Scholar
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., Salakoski, T.: Bioinfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8, 50 (2007)
Article PubMed PubMed Central Google Scholar
Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε- truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007)
Google Scholar
Song, D., Bruza, P.D.: Discovering information flow using a high dimensional conceptual space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Glasgow, Glasgow, Scotland, G12 8QQ
Tamara Polajnar & Mark Girolami

Authors

Tamara Polajnar
View author publications
You can also search for this author in PubMed Google Scholar
Mark Girolami
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Automatic Control and Systems Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Visakan Kadirkamanathan
Department of Computer Science and Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Guido Sanguinetti
University of Glasgow, Department of Computing Science, Sir Alwyn Williams Building, Lilybank Gardens, Glasgow, G12 8QQ, UK, and, University of Glasgow, Department of Statistics, 14 University Gardens, Glasgow, G12 8QQ, UK
Mark Girolami
School of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mahesan Niranjan
Department of Chemical and Process Engineering, University of Sheffield, Mappin Street, S1 3JD, Sheffield, UK
Josselin Noirel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Polajnar, T., Girolami, M. (2009). Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics. In: Kadirkamanathan, V., Sanguinetti, G., Girolami, M., Niranjan, M., Noirel, J. (eds) Pattern Recognition in Bioinformatics. PRIB 2009. Lecture Notes in Computer Science(), vol 5780. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04031-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-04031-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04030-6
Online ISBN: 978-3-642-04031-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

Abstract

Chapter PDF

Similar content being viewed by others

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Improving Document Prioritization for Protein-Protein Interaction Extraction Using Shallow Linguistics and Word Embeddings

Learning Bayesian Network Using Parse Trees for Extraction of Protein-Protein Interaction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics

Abstract

Chapter PDF

Similar content being viewed by others

Protein-protein interaction extraction with feature selection by evaluating contribution levels of groups consisting of related features

Improving Document Prioritization for Protein-Protein Interaction Extraction Using Shallow Linguistics and Word Embeddings

Learning Bayesian Network Using Parse Trees for Extraction of Protein-Protein Interaction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation