A Method of Extracting Sentences Containing Protein Function Information from Articles by Iterative Learning with Feature Update

  • Kazunori Miyanishi
  • Takenao Ohkawa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7845)

Abstract

Proteins are important macromolecules in living systems and serve various functions in almost all biological processes. Protein function information is reported in many scientific articles. Extraction of the function information from the articles is useful for drug discovery, understanding of life phenomenon, and so on. However, it is infeasible to extract the function information manually from a number of articles. In this paper, we propose a method of extracting sentences containing protein function information by iterative learning with feature update. In this method, we use a classifier in order to distinguish the sentences containing the function information from the other sentences, and introduce a semi-automatic procedure, in which a new classifier is reconstructed based on the user’s feedback for the previous classified results. In the experiment with twelve articles as feedback data, it was confirmed that F-measure was improved by iterating learning without getting the negative effect of the feedback.

Keywords

protein function information information extraction decision tree iterative learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Berg, J., Tymoczko, J., Stryer, L.: Biochemistry, 5th edn., vol. 423, pp. 436–437. WH Freeman and Company (2002)Google Scholar
  2. 2.
    Wu, C.H., Yeh, L.S.L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R.S., Suzek, B.E., et al.: The protein information resource. Nucleic Acids Research 31, 345–347 (2003)CrossRefGoogle Scholar
  3. 3.
    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Research 28, 235–242 (2000)CrossRefGoogle Scholar
  4. 4.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., et al.: The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Research 31, 365–370 (2003)CrossRefGoogle Scholar
  5. 5.
    Tsai, R.T.H., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y., Hsu, W.L.: Nerbio: Using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(suppl. 5), S11 (2006)Google Scholar
  6. 6.
    Sun, C., Guan, Y., Wang, X., Lin, L.: Biomedical Named Entities Recognition Using Conditional Random Fields Model. In: Wang, L., Jiao, L., Shi, G., Li, X., Liu, J. (eds.) FSKD 2006. LNCS (LNAI), vol. 4223, pp. 1279–1288. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  7. 7.
    Lafferty, J., Pereira, F., McCallum, A.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning, ICML 2001 (2001)Google Scholar
  8. 8.
    Seki, K., Mostafa, J.: An approach to protein name extraction using heuristics and a dictionary. In: The American Society for Information Science and Technology (ASIST) Annual Meeting, vol. 40, pp. 71–77 (2003)Google Scholar
  9. 9.
    Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Learning to extract proteins and their interactions from medline abstracts. In: Proceedings of the International Conference on Machine Learning 2003 Workshop on Machine Learning in Bioinformatics, pp. 46–53 (2003)Google Scholar
  10. 10.
    Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), pp. 328–334 (1999)Google Scholar
  11. 11.
    Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583 (2000)Google Scholar
  12. 12.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  13. 13.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer (1995)Google Scholar
  14. 14.
    Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Information extraction: Identifying protein names from biological papers. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 707–718 (1998)Google Scholar
  15. 15.
    Tanabe, L., Wilbur, W.J.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)CrossRefGoogle Scholar
  16. 16.
    Cooper, J.W., Kershenbaum, A.: Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information. BMC Bioinformatics 6, 143 (2005)CrossRefGoogle Scholar
  17. 17.
    Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-protein interactions from the literature: part ii. Bioinformatics 21(15), 3294–3300 (2005)CrossRefGoogle Scholar
  18. 18.
    Munna, M.A., Ohkawa, T.: A method to extract sentences with protein functional information from literature by iterative learning of the corpus. IPSJ Transactions on Bioinformatics 47(SIG 17(TBIO 1)), 22–30 (2006)Google Scholar
  19. 19.
    Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. In: Proceedings of the Neural Information Processing Systems (NIPS 2000), vol. 13 (2001)Google Scholar
  20. 20.
    Quilan, J.R.: Decision trees and multi-valued attributes. Machine Intelligence 11, 305–318 (1988)Google Scholar
  21. 21.
    Quilan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)Google Scholar
  22. 22.
    Utgoff, P.E.: Incremental induction of decision trees. Machine Learning 4, 161–186 (1989)CrossRefGoogle Scholar
  23. 23.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  24. 24.
    Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21, 543–565 (1995)Google Scholar
  25. 25.
    Numa, M., Kaneta, Y., Ohkawa, T.: Automatic classification of proper names in protein-related literatures using database retrieval on www. In: Proceedings of the Fifth International Conference on Computational Biology and Genome Informatics, CBGI 2003, pp. 903–906 (2003)Google Scholar
  26. 26.
    Kaneta, Y., Munna, M.A., Ohkawa, T.: A method for extracting sentences related to protein interaction from literature using a structure database. In: Proceedings of the Second Workshop on Data Mining and Text Mining for Bioinformatics (in conjunction with ECML/PKDD 2004), pp. 18–25 (2004)Google Scholar
  27. 27.
    Martin, P.D., Malkowski, M.G., Box, J., Esmon, C.T., Edwards, B.F.P.: New insights into the regulation of the blood clotting cascade derived from the x-ray crystal structure of bovine meizothrombin des f1 in complex with ppack. Structure 5, 1681–1693 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Kazunori Miyanishi
    • 1
  • Takenao Ohkawa
    • 2
  1. 1.Graduate School of Science and TechnologyKobe UniversityNadaJapan
  2. 2.Graduate School of System InformaticsKobe UniversityNadaJapan

Personalised recommendations