Information Retrieval

, Volume 10, Issue 4–5, pp 341–363 | Cite as

An empirical study of tokenization strategies for biomedical information retrieval

Article

Abstract

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.

Keywords

Biomedical information retrieval Tokenization Stemming Stop word 

References

  1. Ando, R. K., Dredze, M., & Zhang, T. (2005). TREC 2005 genomics track experiments at IBM Watson. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).Google Scholar
  2. Buttcher, S., Clarke, C. L. A., & Cormack, G. V. (2004). Domain-specific synonym expansion and validation for biomedical information retrieval (MultiText experiments for TREC 2004). In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).Google Scholar
  3. Carpenter, B. (2004). Phrasal queries with LingPipe and Lucene: ad hoc genomics text retrieval. In Proceedings of Thirteenth Text REtrieval Conference (TREC 2004).Google Scholar
  4. Crangle, C., Zbyslaw, A., Cherry, J. M., & Hong, E. L. (2004). Concept extraction and synonym management for biomedical information retrieval. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).Google Scholar
  5. Dayanik, A., Nevill-Manning, C. G., & Oughtred, R. (2003). Partitioning a graph of sequences, structures and abstracts for information retrieval. In Proceedings of the Twelfth Text REtreival Conference (TREC 2003).Google Scholar
  6. Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 49–56). ACM Press.Google Scholar
  7. Fujita, S. (2004). Revisiting again document length hypotheses TREC-2004 genomics track experiments at Patolis. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).Google Scholar
  8. Harman, D. (1991). How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15.CrossRefGoogle Scholar
  9. Hersh, W. R., & Bhuptiraju, R. T. (2003). TREC genomics track overview. In Proceedings of the Twelvth Text REtrieval Conference (TREC 2003).Google Scholar
  10. Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M., & Kraemer, D. F. (2004). TREC 2004 genomics track overview. In Proceedings of the Thirteenth Text REtrieval Conference (TREC 2004).Google Scholar
  11. Hersh, W., Cohen, A., Yang, J., Bhuptiraju, R. T., Roberts, P., & Hearst, M. (2005). TREC 2005 genomics track overview. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).Google Scholar
  12. Huang, X., Zhong, M., & Si, L. (2005). York University at TREC 2005: genomics track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005).Google Scholar
  13. Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). ACM Press.Google Scholar
  14. Lovins, J. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31.Google Scholar
  15. Pirkola, A., & Leppanen, E. (2003). TREC 2003 genomics track experiments at UTA. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).Google Scholar
  16. Porter, M. F. (1997). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  17. Savoy, J., Rasolofo, Y., & Perret, L. (2003). Report on the TREC 2003 experiment: genomic and web searches. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).Google Scholar
  18. Song, Y-I., Han, K-S., Seo, H-C., Kim, S-B., & Rim, H-C. (2003). Biomedical text retrieval system at Korea University. In Proceedings of the Twelfth Text REtrieval Conference (TREC 2003).Google Scholar
  19. Tomlinson, S. (2003). Robust, web and genomics retrieval with Hummingbird SearchServer at TREC 2003. In Poceedings of the Twelfth Text REtrieval Conference (TREC 2003).Google Scholar
  20. Zhai, C. (2001). Notes on the Lemur TFIDF model. http://www.cs.cmu.edu/ lemur/1.1/tfidf.ps.Google Scholar
  21. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Management (pp. 403–410). ACM Press.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations