Automatic Extraction of Genomic Glossary Triggered by Query

  • Jiao Li
  • Xiaoyan Zhu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3916)


In the domain of genomic research, the understanding of specific gene name is a portal to most Information Retrieval (IR) and Information Extraction (IE) systems. In this paper we present an automatic method to extract genomic glossary triggered by the initial gene name in query. LocusLink gene names and MEDLINE abstracts are employed in our system, playing the roles of query triggers and genomic corpus respectively. The evaluation of the extracted glossary is through query expansion in TREC2003 Genomics Track ad hoc retrieval task, and the experiment results yield evidence that 90.15% recall can be achieved.


Query Term Automatic Extraction Retrieval Task Unify Medical Language System Inverted Index 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
    Chiang, J.H., Yu, H.H.: MeKE: discovering the functions of gene products form biomedical literature via sentence alignment. Bioinformatics 19(11), 1417–1422 (2003)CrossRefGoogle Scholar
  4. 4.
    Pruitt, K.D., et al.: Introducing RefSeq and LocusLink: curated human genome resource at the NCBI. Trends Genet. 16(1), 44–47 (2000)CrossRefGoogle Scholar
  5. 5.
    LocusLink Home Page (2004),
  6. 6.
    Pustejovsky, J., Castaño, J., Saurí, R., Rumshisky, A., Zhang, J., Luo, W.: Medstract: Creating Large-scale Information Servers for Biomedical Libraries. In: ACL 2002 Workshop on Natural Language Processing in the Biomedical Domain, Philadelphia, PA (2002)Google Scholar
  7. 7.
    The Medstract Project - AcroMed 1.1 (2005),
  8. 8.
    Gilbert, D.G.: euGenes: A Eukaryote Genome Information System. Nucletic Acids Research 30(1), 145–148 (2002)CrossRefGoogle Scholar
  9. 9.
    Genomic Information for Eukaryotic Organisms (2005),
  10. 10.
    U.S. National Library of Medicine Medical Subject Headings (MeSH) Home Page (2005),
  11. 11.
    Humphreys, L., Lindberg, D.A.B., Schoolman, H.M., Barnett, G.O.: The Unified Medical Language System: An Informatics Collaboration. Journal of the American Medical Informatics Association 1(5), 1–13 (1998)CrossRefGoogle Scholar
  12. 12.
    Unified Medical Language System (UMLS),
  13. 13.
    Pustejovsky, J., Castaño, J., Cochran, B., Kotecki, M., Morrell, M., Rumshisky, A.: Linguistic Knowledge Extraction from Medline: Automatic Construction of an Acronym Database. Medinfo (2001)Google Scholar
  14. 14.
    Chang, J.T., Schütze, H., Altman, R.B.: Creating an Online Dictionary of Abbreviations from MEDLINE. The Journal of the American Medical Informatics Association 9(6), 612–620 (2002)CrossRefGoogle Scholar
  15. 15.
    Biomedical Abbreviation (2005),
  16. 16.
    Yu, H., Hatzivassiloglou, V., Rzhetsky, A., Wilbur, W.J.: Automatically identifying gene/protein terms in MEDLINE abstracts. J. Biomed. Inform. 35(5-6), 322–330 (2003)CrossRefGoogle Scholar
  17. 17.
    Hisamitsu, T., Niwa, Y.: Extraction of useful terms form parenthetical expression by using simple rules and statistical measures. In: Proceedings of the First Workshop on Computational Terminology, Compu Term 1998, Montreal, Ontario, August 15, 1998, pp. 36–42 (1998)Google Scholar
  18. 18.
    Satou, K., Yamamoto, K.: Utilizing weakly controlled vocabulary for sentence segmentation in biomedical literature. Silico Biology 5 (2004)Google Scholar
  19. 19.
    Kohli, J.: Genetic, Nomenclature and Gene List of the Fission Yeast, Schizosaccharomyces pombe. Curr. Genet. 11(8), 575–589 (1987)CrossRefGoogle Scholar
  20. 20.
    Wain, H.M., Bruford, E.A., Lovering, R.C., Lush, M.J., Wright, M.W., Povey, S.: Guidelines for Human Gene Nomenclature. Genomics 79(4), 464–470 (2002)CrossRefGoogle Scholar
  21. 21.
    HUGO Gene Nomenclature Committee (2005),
  22. 22.
    Maltais, L.J., et al.: Rules and Guidelines for mouse gene nomenclature: a condensed version. International committee on standardized genetic nomenclature for mice. Genomics 45(2), 471–476 (1997)Google Scholar
  23. 23.
    Antonarakis, S.E.: Recommendations for a nomenclature system for human gene mutations. Nomenclature working group. Hum. Mutat. 11(1), 1–3 (1998)Google Scholar
  24. 24.
    Horvitz, H.R., et al.: A Uniform Genetic Nomenclature for the Nematode Caenorhabditis Elegans. Mol. Gen. Genet. 175(2), 129–133 (1979)CrossRefGoogle Scholar
  25. 25.
    Baeza-Yates, R., Riberiro-Neto, B.: Modern Information Retrieval, pp. 24–138. ACM Press, New York (1999)Google Scholar
  26. 26.
    Hersh, W.R., Ravi, T.B.: TREC Genomics Track Overview. In: The Twelfth Text Retrieval Conference: TREC 2003. National Institute of Standards and Technology, Gaithersburg, MD (2003) Google Scholar
  27. 27.
    Li, J., Zhang, X., Zhang, M., Zhu, X.: THUIR at TREC 2004: Genomics Track. In: Proceedings of 13th Text Retrireval Conference (TREC 2004), Gaithersburg, USA, pp. 571–575 (November 2004)Google Scholar
  28. 28.
    Klavans, J., Muresan, S.: Evaluation of the DEFINDER System for Full Automatic Glossary Construction. In: Proceedings of the AMIA Symposium (2001)Google Scholar
  29. 29.
    Alexander, S., Yeh, L.H., Alexander, A.: Background and Overview for KDD Cup 2002 Task 1: Information Extraction from Biomedical Articles. SIGKDD Explorations 4(2), 87–89 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jiao Li
    • 1
  • Xiaoyan Zhu
    • 1
  1. 1.State Key Laboratory of Intelligent Technology and Systems (LITS), Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations