Topic Detection and Tracking pp 33-66

Part of the The Information Retrieval Series book series (INRE, volume 12) | Cite as

Corpora for Topic Detection and Tracking

  • Christopher Cieri
  • Stephanie Strassel
  • David Graff
  • Nii Martey
  • Kara Rennert
  • Mark Liberman

Abstract

The TDT corpora, developed to support the DARPA-sponsored program in Topic Detection and Tracking, combine data collected over a nine month period from 8 English and 3 Chinese sources. The published corpora contain audio, reference text including written news text and transcripts of the broadcast audio, boundary tables segmenting the broadcasts into stories and relevance tables resulting from millions of human judgments. Sections of the corpora have undergone topic-story, first story and story link annotation. Both the TDT-2 and TDT-3 text corpora and the accompanying broadcast audio are now available from the Linguistic Data Consortium. This paper described the raw material collected for the corpora, the annotation of that material to prepare it for research use and the formats in which it is distributed. Special attention is paid to the quality control measures developed for these data sets.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cieri, Christopher, et al., 2000 Large Multilingual Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT2 and TDT3 Corpus Efforts, Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.Google Scholar
  2. CLSP - The Johns Hopkins University Center for Language and Speech Processing, 1999, Topic-Based Novelty Detection, http://www.clsp.jhu.edu/ws99/projects/tdt/index.html Google Scholar
  3. Doddington, George, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan: Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.Google Scholar
  4. Doddington, George, 1998, The Topic Detection and Tracking Phase 2 (TDT-2) Evaluation Plan http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdfGoogle Scholar
  5. Garofalo, et. al., 2000, The TREC Spoken Document Retrieval Track: A Success Story, April 2000.Google Scholar
  6. Linguistic Data Consortium, 2000, Topic Detection and Tracking Pages, http://www.ldc.upenn.edu/TDTGoogle Scholar
  7. NIST — National Institute for Standards and Technology, 1999, 1999 NIST Broadcast News Evaluation, http://www.nist.gov/speech/tests/bnr/bnews_99/bnews_99.htm Google Scholar
  8. NIST — National Institute for Standards and Technology, 2000, ACE — Automatic Content Extraction, http://www.nist.gov/speech/tests/ace/ Google Scholar
  9. NIST — National Institute for Standards and Technology, 2000, The 2000 NIST Hub-5 Evaluation, http://www.nist.gov/speech/tests/ctr/h5_2000/index.htm Google Scholar
  10. NIST — National Institute for Standards and Technology, 2000, Topic etection and Tracking, http://www.nist.gov/speech/tests/tdt/tdt2000/index.htm Google Scholar
  11. Strassel, Stephanie, et al., 2000), Quality Control in Large Annotation Projects Involving Multiple Judges: The case of the TDT Corpora Proceedings of the Second International Language Resources and Evaluation Conference, Athens, Greece, May 2000.Google Scholar
  12. Wayne, Charles, 1998, Topic Detection & Tracking: A Case Study in Corpus Creation & Evaluation Methodologies, Proceedings of the First International Conference on Language Resource and Evaluation, Granada, Spain, May 1998.Google Scholar
  13. Wayne, Charles, 1998, Topic Detection and Tracking (TDT): Overview & Perspective, Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, February 1998.Google Scholar

Copyright information

© Springer Science+Business Media New York 2002

Authors and Affiliations

  • Christopher Cieri
    • 1
  • Stephanie Strassel
    • 1
  • David Graff
    • 1
  • Nii Martey
    • 1
  • Kara Rennert
    • 1
  • Mark Liberman
    • 1
  1. 1.Linguistic Data ConsortiumUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations