PubMiner: Machine Learning-Based Text Mining System for Biomedical Information Mining

  • Jae-Hong Eom
  • Byoung-Tak Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3192)


PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature is introduced. PubMiner utilize natural language processing and machine learning based data mining techniques for mining useful biological information such as protein-protein interaction from the massive literature data. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language analysis. The extracted interactions are further analyzed with a set of features of each entity which were constructed from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The evaluation of system performance proceeded with the protein interaction data of S.cerevisiae (bakers yeast) from MIPS and SGD.


Natural Language Processing Data Mining Machine Learning Bioinformatics Software Application 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrade, M.A., Borka, P.: Automated extraction of information in molecular biology. FEBS Letters 476, 12–17 (2000)CrossRefGoogle Scholar
  2. 2.
    Chiang, J.H., et al.: GIS: a biomedical text–mining system for gene information discovery. Bioinformatics 20(1), 120–121 (2004)CrossRefGoogle Scholar
  3. 3.
    Blaschke, C., et al.: Automatic extraction of biological information from scientific text: protein–protein interactions. In: Proc. of ISMB 1999, Heidelberg, Germany, pp. 60–67 (1999)Google Scholar
  4. 4.
  5. 5.
    Tanabe, L., et al.: MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. BioTechniques 27, 1210–1217 (1999)MathSciNetGoogle Scholar
  6. 6.
    Safran, M., et al.: Human gene-centric databases at the Weizmann institute of science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 31(1), 142–146 (2003)CrossRefGoogle Scholar
  7. 7.
    Andrade, M., Valencia, A., Automatic, A.: extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998)CrossRefGoogle Scholar
  8. 8.
    Perez-Iratxeta, C., et al.: XplorMed: a tool for exploring MEDLINE abstracts. Trends. Biochem. Sci. 26, 573–575 (2001)CrossRefGoogle Scholar
  9. 9.
    Friedman, C., et al.: GENIS: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl.1), S74–S82 (2001)Google Scholar
  10. 10.
    Daraselia, N., et al.: Extracting human protein interactions from MEDLINE using a fullsentence parser. Bioinformatics 20(5), 604–611 (2004)CrossRefGoogle Scholar
  11. 11.
    Nédellec, C., et al.: Machine learning for information extraction in genomics – state of the art and perspectives. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzzi. and Soft Comp., vol. 138, pp. 99–118. Springer, Heidelberg (2004)Google Scholar
  12. 12.
    Humphreys, B.L., et al.: The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1–11 (1998)CrossRefGoogle Scholar
  13. 13.
    Kim, J.D., et al.: GENIA corpus - semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl. 1), i180–182 (2003)CrossRefGoogle Scholar
  14. 14.
    Hwang, Y.S., et al.: Weighted probabilistic sum model based on decision tree decomposition for text chunking. Int. J. Comp. Proc. Orient. Lang. 16(1), 1–20 (2003)CrossRefGoogle Scholar
  15. 15.
    Lee, K.J., et al.: Two-phase biomedical NE recognition based on SVMs. In: Proc. of ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp. 33–40 (2003)Google Scholar
  16. 16.
    Eom, J.H., et al.: PubMiner – a machine learning-based biomedical text mining system. Technical Report BI–TR0401), Biointelligence Lab., Seoul National University (2004)Google Scholar
  17. 17.
    Christie, K.R., et al.: Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32(1), D311–D314 (2004)CrossRefGoogle Scholar
  18. 18.
    Mewes, H.W., et al.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 32(1), D41–D44 (2004)CrossRefGoogle Scholar
  19. 19.
    Agrawal, R., et al.: Mining association rules between sets of items in large databases. In: Proc. of ACM SIGMOD 1993, Washington D.C., USA, pp. 207–216 (1993)Google Scholar
  20. 20.
    Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of SIGIR 2000, Athens, Greece, pp. 208–215 (2000)Google Scholar
  21. 21.
    Yu, L., Liu, H.: Feature selection for high dimensional data: a fast correlation-based filter solution. In: Proc. of ICML 2003, Washington D.C., USA, pp. 856–863 (2003)Google Scholar
  22. 22.
    Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  23. 23.
    Press, W.H., et al.: Numerical recipes in C. Cambridge University Press, Cambridge (1988)zbMATHGoogle Scholar
  24. 24.
    Oyama, T., et al.: Extraction of knowledge on protein–protein interaction by association rule discovery. Bioinformatics 18, 705–714 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Jae-Hong Eom
    • 1
  • Byoung-Tak Zhang
    • 1
  1. 1.Biointelligence Lab., School of Computer Science and EngineeringSeoul National UniversitySeoulSouth Korea

Personalised recommendations