Abstract
With the help of a team of expert biologist judges, the TREC Genomics track has generated four large sets of “gold standard” test collections, comprised of over a hundred unique topics, two kinds of ad hoc retrieval tasks, and their corresponding relevance judgments. Over the years of the track, increasingly complex tasks necessitated the creation of judging tools and training guidelines to accommodate teams of part-time short-term workers from a variety of specialized biological scientific backgrounds, and to address consistency and reproducibility of the assessment process. Important lessons were learned about factors that influenced the utility of the test collections including topic design, annotations provided by judges, methods used for identifying and training judges, and providing a central moderator “meta-judge”.
This is a preview of subscription content, access via your institution.



References
Borlund, P. (2003). The concept of relevance in IR. Journal of the American Society for Information Science and Technology, 54(10), 913–925.
Burkhardt, K., Schneider, B., et al. (2006). A biocurator perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank. PLoS Computational Biology, 2(10), e99.
Cohen, A. M., & Hersh, W. R. (2006). The TREC 2004 genomics track categorization task: Classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration, 1, 4.
Cohen, K. B., Fox, L., et al. (2005). Empirical data on corpus design and usage in biomedical natural language processing. AMIA Annual Symposium Proceedings, 156–160.
Colosimo, M. E., Morgan, A. A., et al. (2005). Data preparation and interannotator agreement: BioCreAtIvE task 1B. BMC Bioinformatics, 6(Suppl 1), S12. Epub 2005 May 24.
Dingare, S., Finkel, J., et al. (2005). A system for identifying named entities in biomedical text: How results from two evaluations reflect on both the system and the evaluations. Comparative and Functional Genomics, 6, 77–85.
Gerstein, M., Seringhaus, M., et al. (2007). Structured digital abstract makes text mining easy. Nature, 447(7141), 142.
Hahn, U., Wermter, J., et al. (2007). Text mining: Powering the database revolution. Nature, 448(7150), 130.
Hersh, W., & Bhupatiraju, R. T. (2003). TREC genomics track overview. The Twelfth Text Retrieval Conference (TREC 2004). Gaithersburg, MD: National Institute for Standards & Technology.
Hersh, W. R., Cohen, A., et al. (2006). TREC 2006 genomics track overview. The Fifteenth Text Retrieval Conference (TREC 2006). Gaithersburg, MD: National Institute for Standards & Technology.
Hersh, W. R., Cohen, A., et al. (2007). TREC 2007 genomics track overview. The Sixteenth Text Retrieval Conference (TREC 2007). Gaithersburg, MD: National Institute for Standards & Technology.
Hirschman, L., Morgan, A. A., et al. (2002). Rutabaga by any other name: Extracting biological names. Journal of Biomedical Informatics, 35(4), 247–259.
Hirschman, L., Yeh, A., et al. (2005). Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1), S1.
Hripcsak, G., & Wilcox, A. (2002). Reference standards, judges, and comparison subjects: Roles for experts in evaluating system performance. Journal of the American Medical Informatics Association, 9(1), 1–15.
Ide, N. C., Loane, R. F., et al. (2007). Essie: A concept-based search engine for structured biomedical text. Journal of the American Medical Informatics Association, 14(3), 253–263.
Lipscomb, C. E. (2000). Medical subject headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266.
Medlock, B. (2008). Exploring hedge identification in biomedical literature. Journal of Biomedical Informatics, 41(4), 636–654.
Pyysalo, S., Airola, A., et al. (2008). Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics, 9(Suppl 3), S6.
Salimi, N., & Vita, R. (2006). The biocurator: Connecting and enhancing scientific data. PLoS Computational Biology, 2(10), e125.
Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36, 697–716.
Xu, Y. C., & Chen, Z. (2006). Relevance judgment: What do information users consider beyond topicality? Journal of the American Society for Information Science and Technology, 57(7), 961–973.
Yeh, A. S., Hirschman, L., et al. (2003). Evaluation of text data mining for database curation: Lessons learned from the KDD Challenge Cup. Bioinformatics, 19(Suppl 1), i331–i339.
Acknowledgements
The TREC Genomics Track was funded by grant ITR-0325160 to W.R.H. from the U.S. National Science Foundation. The authors would like to thank the Genomics track steering committee, especially Kevin Bretonnel Cohen and Anna Divoli, for helpful discussions about relevance judgments and guidelines.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Roberts, P.M., Cohen, A.M. & Hersh, W.R. Tasks, topics and relevance judging for the TREC Genomics Track: five years of experience evaluating biomedical text information retrieval systems. Inf Retrieval 12, 81–97 (2009). https://doi.org/10.1007/s10791-008-9072-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10791-008-9072-x
Keywords
- Reference standards
- Evaluation
- Inter-annotator agreement
- Text mining
- Information retrieval