CLEF-2005 CL-SR at Maryland: Document and Query Expansion Using Side Collections and Thesauri

  • Jianqiang Wang
  • Douglas W. Oard
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4022)


This paper reports results for the University of Maryland’s participation in the CLEF-2005 Cross-Language Speech Retrieval track. Techniques that were tried include: (1) document expansion with manually created metadata (thesaurus keywords and segment summaries) from a large side collection, (2) query refinement with pseudo-relevance feedback, (3) keyword expansion with thesaurus synonyms, and (4) cross-language speech retrieval using translation knowledge obtained from the statistics of a large parallel corpus. The results show that document expansion and query expansion using blind relevance feedback were effective, although optimal parameter choices differed somewhat between the training and evaluation sets. Document expansion in which manually assigned keywords were augmented with thesaurus synonyms yielded marginal gains on the training set, but no improvement on the evaluation set. Cross-language retrieval with French queries yielded 79% of monolingual mean average precision when searching manually assigned metadata despite a substantial domain mismatch between the parallel corpus and the retrieval task. Detailed failure analysis indicates that speech recognition errors for named entities were an important factor that substantially degraded retrieval effectiveness.


Average Precision Automatic Speech Recognition Query Expansion Mean Average Precision Test Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allan, J.: Perspectives on information retrieval and speech. In: Information Retrieval Techniques for Speech Applications, pp. 1–10. Springer, London (2001)Google Scholar
  2. 2.
    Darwish, K., Oard, D.W.: Probabilistic structured query methods. In: Proceedings of the 21st Annual 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2003, pp. 338–344. ACM Press, New York (2003)Google Scholar
  3. 3.
    Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.E.: The TREC spoken document retrieval track: A successful story. In: Proceedings of the Nineth Text REtrieval Conference (TREC-9) (2000), http://trec.nist.dov
  4. 4.
    Koehn, P.: Europarl: A multilingual corpus for evaluation of machine translation (unpublished draft, 2002)Google Scholar
  5. 5.
    Oard, D.W., Soergel, D., Doermann, D., Huang, X., Murray, G.C., Wang, J., Ramabhadran, B., Franz, M., Gustman, S., Mayfield, J., Kharevych, L., Strassel, S.: Building an information retrieval test collection for spontaneous conversational speech. In: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41–38 (2004)Google Scholar
  6. 6.
    Robertson, S.E., Sparck-Jones, K.: Simple proven approaches to text retrieval. Cambridge University Computer Laboratory (1997)Google Scholar
  7. 7.
    Singhal, A., Choi, J., Hindle, D., Pereira, F.: ATT at TREC-7. In: The Seventh Text REtrieval Conference, pp. 239–252 (November 1998),
  8. 8.
    Singual, A., Pereira, F.: Document expansion for speech retrieval. In: Proceedings of the 22st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 34–41. ACM Press, New York (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Jianqiang Wang
    • 1
  • Douglas W. Oard
    • 1
  1. 1.College of Information Studies and UMIACSUniversity of MarylandCollege ParkUSA

Personalised recommendations