Using Concept-Based Indexing to Improve Language Modeling Approach to Genomic IR

  • Xiaohua Zhou
  • Xiaodan Zhang
  • Xiaohua Hu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


Genomic IR, characterized by its highly specific information need, severe synonym and polysemy problem, long term name and rapid growing literature size, is challenging IR community. In this paper, we are focused on addressing the synonym and polysemy issue within the language model framework. Unlike the ways translation model and traditional query expansion techniques approach this issue, we incorporate concept-based indexing into a basic language model for genomic IR. In particular, we adopt UMLS concepts as indexing and searching terms. A UMLS concept stands for a unique meaning in the biomedicine domain; a set of synonymous terms will share same concept ID. Therefore, the new approach makes the document ranking effective while maintaining the simplicity of language models. A comparative experiment on the TREC 2004 Genomics Track data shows significant improvements are obtained by incorporating concept-based indexing into a basic language model. The MAP (mean average precision) is significantly raised from 29.17% (the baseline system) to 36.94%. The performance of the new approach is also significantly superior to the mean (21.72%) of official runs participated in TREC 2004 Genomics Track and is comparable to the performance of the best run (40.75%). Most official runs including the best run extensively use various query expansion and pseudo-relevance feedback techniques while our approach does nothing except for the incorporation of concept-based indexing, which evidences the view that semantic smoothing, i.e. the incorporation of synonym and sense information into the language models, is a more standard approach to achieving the effects traditional query expansion and pseudo-relevance feedback techniques target.


Language Model Query Expansion Mean Average Precision Word Sense Disambiguation Information Retrieval Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Berger, A., Lafferty, J.D.: Information Retrieval as Statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)Google Scholar
  2. 2.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  3. 3.
    Hersh, W., et al.: TREC 2004 Genomics Track Overview. In: The thirteenth Text Retrieval Conference (2004)Google Scholar
  4. 4.
    Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: 2001 ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001 (2001)Google Scholar
  5. 5.
    Lafferty, J., Zhai, C.: Probabilistic relevance models based on document and query generation. In: Language Modeling and Information Retrieval. Kluwer International Series on Information Retrieval, vol. 13 (2003)Google Scholar
  6. 6.
    Lesk, M.: Automatic Sense Disambiguation: How to Tell a Pine Cone from and Ice Cream Cone. In: Proceedings of the SIGDOC 1986 Conference, ACM (1986)Google Scholar
  7. 7.
    Mooney, R.J., Bunescu, R.: Mining Knowledge from Text Using Information Extraction. SIGKDD Explorations. Special issue on Text Mining and Natural Language Processing 7(1), 3–10 (2005)Google Scholar
  8. 8.
    Palakal, M., Stephens, M., Mukhopadhyay, S., Raje, R., Rhodes, S.: A multi-level text mining method to extract biological relationships. In: Proceedings of the IEEE Computer Society Bioinformatics Conference (CBS 2002), August 14-16, pp. 97–108 (2002) Google Scholar
  9. 9.
    Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and Development in Information RetrievalGoogle Scholar
  10. 10.
    Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)CrossRefGoogle Scholar
  11. 11.
    Sanderson, M.: Word sense disambiguation and information retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland, July 3-6, pp. 142–151 (1994)Google Scholar
  12. 12.
    Soderland, S., Fisher, D., Aseltine, J., Lehnert, W.: "CRYSTAL: Inducing a Conceptual Dictionary". In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314–1319 (1995)Google Scholar
  13. 13.
    Soderland, S.: Learning Information Extraction rules for Semi-structured and free text. Machine Learning 34, 233–272 (1998)CrossRefMATHGoogle Scholar
  14. 14.
    Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments, Part I. Information Processing and Management 36, 779–808 (2000)CrossRefGoogle Scholar
  15. 15.
    Stokoe, C., Tait, J.I.: Towards a Sense Based Document Representation for Information Retrieval. In: Proceedings of the Twelfth Text REtrieval Conference (TREC), Gaithersburg M.D (2004)Google Scholar
  16. 16.
    Subramaniam, L., Mukherjea, S., Kankar, P., Srivastava, B., Batra, V., Kamesam, P., Kothari, R.: Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application. In: The Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, Louisiana (2003)Google Scholar
  17. 17.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 2(2) (April 2004)Google Scholar
  18. 18.
    Zhou, X., Han, H., Chankai, I., Prestrud, A.,Brooks, A.: Converting Semi-structured Clinical Medical Records into Information and Knowledge. In: Proceeding of The International Workshop on Biomedical Data Engineering (BMDE) in conjunction with the 21stInternational Conference on Data Engineering (ICDE), Tokyo, Japan, April 5-8 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiaohua Zhou
    • 1
  • Xiaodan Zhang
    • 1
  • Xiaohua Hu
    • 1
  1. 1.College of Information Science & TechnologyDrexel UniversityPhiladelphiaUSA

Personalised recommendations