Prioritizing Literature Search Results Using a Training Set of Classified Documents

  • Sérgio Matos
  • José Luis Oliveira
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 93)


Finding relevant articles is rapidly becoming a demanding task for researchers in the biomedical field, due to the rapid expansion of the scientific literature. We investigate the use of ranking strategies for prioritizing literature search results given an initial topic of interest. Focusing on the topic of protein-protein interactions, we compared ranking strategies based on different classifiers and features. The best result obtained on the BioCreative III PPI test set was an area under the interpolated precision-recall curve of 0,629. We then analyze the use of this method for ranking the result of PubMed queries. The results shown indicate that this strategy can be used by database curators to prioritize articles for extraction of protein-protein interactions, and also by general researchers looking for publications describing protein-protein interactions within a particular area of interest.


Information Retrieval Biomedical Literature Protein-protein Interactions Article Classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    National Library of Medicine, MEDLINE Fact Sheet (2010), (accessed December 3, 2010)
  2. 2.
    Altman, R., Bergman, C., Blake, J., Blaschke, C., Cohen, A., Gannon, F., Grivell, L., Hahn, U., Hersh, W., Hirschman, L., et al.: Text mining for biology - the way forward: opinions from leading scientists. Genome Biol. 9 (Suppl. 2), 7 (2008)CrossRefGoogle Scholar
  3. 3.
    Rebholz-Schuhmann, D., Kirsch, H., Couto, F.: Facts from text–is text mining ready to deliver? PLoS Biol. 3(2), e65 (2005)CrossRefGoogle Scholar
  4. 4.
    Lehne, B., Schlitt, T.: Protein–protein interaction databases: Keeping up with growing interactomes. Hum. Genomics 3(3), 291–297 (2009)Google Scholar
  5. 5.
    Krallinger, M., Valencia, A., Hirschman, L.: Linking genes to literature: text mining, information extraction and retrieval applications for biology. Genome Biol. 9 (Suppl. 2), S8 (2008)CrossRefGoogle Scholar
  6. 6.
    Jensen, L.J., Saric, J., Bork, P.: Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 7, 119–129 (2006)CrossRefGoogle Scholar
  7. 7.
    National Library of Medicine, Entrez Programming Utilities, (accessed December 3, 2010)
  8. 8.
    Suomela, B.P., Andrade, M.A.: Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 6, 75 (2005)CrossRefGoogle Scholar
  9. 9.
    Fontaine, J.F., Barbosa-Silva, A., Schaefer, M., Huska, M.R., Muro, E.M., Andrade-Navarro, M.A.: MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res. 37(Web Server issue), W141–W146 (2009)CrossRefGoogle Scholar
  10. 10.
    Jang, H., Lim, J., Lim, J.H., Park, S.J., Lee, K.C., Park, S.H.: Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 22(14), e220–e226 (2006)CrossRefGoogle Scholar
  11. 11.
    Yin, L., Xu, G., Torii, M., Niu, Z., Maisog, J.M., Wu, C., Hu, Z., Liu, H.: Document classification for mining host pathogen protein-protein interactions. Artif. Intell. Med. 49(3), 155–160 (2010)CrossRefGoogle Scholar
  12. 12.
    Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17(4), 359–363 (2001)CrossRefGoogle Scholar
  13. 13.
    Lan, M., Tan, C.L., Su, J.: Feature generation and representations for protein-protein interaction classification. J. Biomed. Inform. 42(5), 866–872 (2009)CrossRefGoogle Scholar
  14. 14.
    Abi-Haidar, A., Kaur, J., Maguitman, A., Radivojac, P., Rechtsteiner, A., Verspoor, K., Wang, Z., Rocha, L.M.: Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks. Genome Biol. 9 (Suppl. 2), S11 (2008)CrossRefGoogle Scholar
  15. 15.
    He, M., Wang, Y., Li, W.: PPI finder: a mining tool for human protein-protein interactions. PLoS One 4(2) 4(2), e4554 (2009)CrossRefGoogle Scholar
  16. 16.
    Krallinger, M., Vazquez, M., Leitner, F., Valencia, A.: Results of the BioCreative III Article Classification Task. In: Proceedings of the Third BioCreative Workshop, Bethesda, USA, September 13-15 (2010)Google Scholar
  17. 17.
    The Apache Software Foundation, Apache Lucene (2010), (accessed December 3, 2010)
  18. 18.
    HUPO Proteomics Standards Initiative, MI Ontology (2005), (accessed December 3, 2010)
  19. 19.
    Sasaki, Y., Montemagni, S., Pezik, P., Rebholz-Schuhmann, D., McNaught, J., Ananiadou, S.: BioLexicon: A Lexical Resource for the Biology Domain. In: Proceedings of the Third International Symposium on Semantic Mining in Biomedicine, Turku, Finland, September 1-3 (2008)Google Scholar
  20. 20.
    McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), (accessed December 3, 2010)

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sérgio Matos
    • 1
  • José Luis Oliveira
    • 1
  1. 1.Universidade de AveiroAveiroPortugal

Personalised recommendations