MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup

  • Xiaohua Zhou
  • Xiaodan Zhang
  • Xiaohua Hu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4099)


Dictionary-based biological concept extraction is still the state-of-the-art approach to large-scale biomedical literature annotation and indexing. The exact dictionary lookup is a very simple approach, but always achieves low extraction recall because a biological term often has many variants while a dictionary is impossible to collect all of them. We propose a generic extraction approach, referred to as approximate dictionary lookup, to cope with term variations and implement it as an extraction system called MaxMatcher. The basic idea of this approach is to capture the significant words instead of all words to a particular concept. The new approach dramatically improves the extraction recall while maintaining the precision. In a comparative study on GENIA corpus, the recall of the new approach reaches a 57% recall while the exact dictionary lookup only achieves a 26% recall.


Significance Score Boundary Word Biological Concept Biological Term Approximate Match 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chang, J.T., Schütze, H., Altman, R.B.: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–225 (2004)CrossRefGoogle Scholar
  2. 2.
    Chiang, J.-H., Yu, H.-C.: Literature extraction of protein functions using sentence pattern mining. IEEE Transactions on Knowledge and Data Engineering 17(8), 1088–1098 (2005)CrossRefGoogle Scholar
  3. 3.
    Collier, N., Nobata, C., Tsujii, J.: Extracting the names of genes and gene products with a Hidden Markov Model. In: Proc. COLING 2000, pp. 201–207 (2000)Google Scholar
  4. 4.
    Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: Identifying protein names from biological papers. In: Proceedings of Pacific Symposium on Biocomputing, Maui, Hawaii, January 1998, pp. 707–718 (1998)Google Scholar
  5. 5.
    Lesk, M.: Automatic Sense Disambiguation: How to Tell a Pine Cone from and Ice Cream Cone. In: Proceedings of the SIGDOC 1986 Conference, ACM Press, New York (1986)Google Scholar
  6. 6.
    Rindfleisch, T.C., Tanabe, L., Weinstein, J.N.: EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature. In: Proceedings of Pacific Symposium on Bioinformatics, Hawaii, USA, pp. 514–525 (2000)Google Scholar
  7. 7.
    Song, Y.-I., Kim, S.-B., Rim, H.-C.: Terminology Indexing and Reweighting methods for Biomedical Text Retrieval. In: Proceedings of the SIGIR 2004 Workshop on Search and Discovery in Bioinformatics, Sheffield, UK, ACM, New York (2004)Google Scholar
  8. 8.
    Subramaniam, L., Mukherjea, S., Kankar, P., Srivastava, B., Batra, V., Kamesam, P., Kothari, R.: Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application. In: The Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, Louisiana (2003)Google Scholar
  9. 9.
    Tanabe, L., Wilbur, W.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)CrossRefGoogle Scholar
  10. 10.
    Zhou, G.-D., Zhang, J., Su, J., Shen, D., Tan, C.-L.: Recognizing Names in Biomedical Texts: A Machine Learning Approach. Bioinformatics 20(7), 1178–1190 (2004)CrossRefGoogle Scholar
  11. 11.
    Zhou, X., Han, H., Chankai, I., Prestrud, A., Brooks, A.: Converting Semi-structured Clinical Medical Records into Information and Knowledge. In: Proceeding of The International Workshop on Biomedical Data Engineering (BMDE) in conjunction with the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, April 5-8 (2005)Google Scholar
  12. 12.
    Zhou, X., Hu, X., Zhang, X.: Using Concept-based Indexing to Improve Language Modeling Approach to Genomic IR. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
  14. 14.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiaohua Zhou
    • 1
  • Xiaodan Zhang
    • 1
  • Xiaohua Hu
    • 1
  1. 1.College of Information Science & TechnologyDrexel UniversityPhiladelphiaUSA

Personalised recommendations