Abstract
This paper describes a system for retrieving relevant portions of broadcast news shows starting with only the audio data. A novel method of automatically detecting and removing commercials is presented and shown to increase the performance of the system while also reducing the computational effort required. A sophisticated large vocabulary speech recogniser which produces high-quality transcriptions of the audio and a window-based retrieval system with post-retrieval merging are also described.
Results are presented using the 1999 TREC-8 Spoken Document Retrieval data for the task where no story boundaries are known. Experiments investigating the effectiveness of all aspects of the system are described, and the relative benefits of automatically eliminating commercials, enforcing broadcast structure during retrieval, using relevance feedback, changing retrieval parameters and merging during post-processing are shown.
An Average Precision of 46.8%, when duplicates are scored as irrelevant, is shown to be achievable using this system, with the corresponding word error rate of the recogniser being 20.5%.
Similar content being viewed by others
References
Abberley, D., Renals, S., Robinson, T., and Ellis, D. (2000). The THISL SDR system at TREC-8. In E.M. Voorhees and D.K. Harman (Eds.), The Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246. Gaithersburg, MD; Department of Commerce, National Institute of Standards and Technology, pp. 699-706.
Bimbot, F. and Mathan, L. (1993). Text-free speaker recognition using an arithmetic harmonic sphericity measure. Proc. Eurospeech'93, Berlin, Germany, Vol. 1, pp. 169-172.
Cieri, C., Graff, D., Liberman, M., Martey, N., and Strassel, S. (1999). The TDT-2 text and speech corpus. Proc. DARPA 1999 Broadcast News Workshop, Herndon, VA, pp. 57-60.
Dharanipragada, S., Franz, M., and Roukos, S. (1999). Audio indexing for broadcast news. In E.M. Voorhees and D.K. Harman (Eds.), The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 115-119.
Dharanipragada, S. and Roukos, S. (1997). Experimental results in audio indexing. Proc.DARPA1997 Speech RecognitionWorkshop, Chantilly, VA.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
Foote, J.T. (1997). Content-based retrieval of music and audio. Multimedia Storage and Archiving Systems II, Proc. SPIE, 3229, pp. 138-147.
Foote, J.T. (1999). An overview of audio information retrieval. Multimedia Systems, 7(1):2-10.
Franz, M., McCarley, J., and Ward, R. (2000). Ad hoc, crosslanguage and spoken document information retrieval at IBM. In E.M. Voorhees and D.K. Harman (Eds.), The Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246.Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 391-398.
Gales, M.J.F. and Woodland, P.C. (1996). Mean and variance adaptation within the MLLR framework. Computer Speech and Language, 10:249-264.
Garofolo, J.S., Voorhees, E.M., Stanford, V.M., and Spärck Jones, K. (1998). TREC-6 1997 spoken document retrieval track overview and results. In E.M. Voorhees and D.K. Harman (Eds.), The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500-240. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 83-92.
Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M., and Spärck Jones, K. (1999a). The 1999 TREC-8 spoken document retrieval (SDR) track evaluation specification. Available via http://www.nist.gov/speech/tests/sdr/sdr99/sdr99.htm.
Garofolo, J.S., Voorhees, E.M., Auzanne, C.G.P., Stanford, V.S., and Lund, B.A. (1999b). 1998 TREC-7 spoken document retrieval track overview and results. In E.M. Voorhees and D.K. Harman (Eds.), The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 79-90.
Garofolo, J.S., Auzanne, C.G.P., and Voorhees, E.M. (2000). The TREC spoken document retrieval track: A success story.Proc. Recherche d'Informations Assistee par Ordinateur (RIAO) 2000, Content-Based Multimedia Information Access, Paris, France, Vol. 1, pp. 1-20.
Gauvain, J.-L., de Kercadio, Y., Lamel, L., and Adda, G. (2000). The LIMSI SDR system for TREC-8. In E.M. Voorhees and D.K. Harman (Eds.), The Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 475-482.
Hain, T., Johnson, S.E., Tuerk, A., Woodland, P.C., and Young, S.J. (1998). Segment generation and clustering in the HTK broadcast news transcription system. Proc. DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, VA, pp. 133-137.
Hauptmann, A. and Witbrock, M. (1998). Story segmentation and detection of commercials in broadcast news video. '98), Santa Barbara, CA, pp. 168-179.
Johnson, S.E., Jourlin, P., Spärck Jones, K., and Woodland, P.C. (2000). Spoken document retrieval for TREC-8 at Cambridge University. In E.M. Voorhees and D.K. Harman (Eds.), The Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 197-206.
Johnson, S.E., Jourlin, P., Spärck Jones, K., and Woodland, P.C. (2001). Spoken document retrieval for TREC-9 at Cambridge University. 'The Ninth Text REtrieval Conference (TREC-9), to appear.
Johnson, S.E. and Woodland, P.C. (1998). Speaker clustering using direct maximisation of the MLLR-adapted likelihood. Proc. 5th International Conference on Spoken Language Processing, Sydney, Australia, Vol. 5, pp. 1775-1778.
Johnson, S.E. and Woodland, P.C. (2000). A method for direct audio search with applications to indexing and retrieval. Proc. 2000 IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, Vol. 3, pp. 1427-1430.
Jourlin, P., Johnson, S.E., Spärck Jones, K., and Woodland, P.C. (1999a). General query expansion techniques for spoken document retrieval. Proc. ESCA Workshop on Extracting Information from Spoken Audio, Cambridge, England, pp. 8-13.
Jourlin, P., Johnson, S.E., Spärck Jones, K., and Woodland, P.C. (1999b). Improving retrieval on imperfect speech transcriptions. Proc. 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 283-284.
Jourlin, P., Johnson, S.E., Spärck Jones, K., and Woodland, P.C. (2000). Spoken document representations for probabilistic retrieval. Speech Communication, 32(1):21-36.
Kashino, K., Smith, G., and Murase, H. (1999). Time-series active search for quick retrieval of audio and video. Proc. 1999 IEEE International Conference on Acoustics, Speech and Signal Processing, Phoenix, AZ, pp. 2993-2996.
Ng, C., Wilkinson, R., and Zobel, J. (2000). Experiments in spoken document retrieval using phoneme n-grams. Speech Communication, 32(1):61-77.
Odell, J.J., Woodland, P.C., and Hain, T. (1999). The CUHTK-entropic 10xRT broadcast news transcription system. Proc. DARPA 1999 Broadcast News Workshop, Herndon, VA, pp. 271-275.
Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14:130-137.
Robertson, S.E. and Spärck Jones, K. (1997). Simple, proven approaches to text retrieval (Technical Report TR-356). Cambridge University Computer Laboratory.
Robinson, A., Abberley, D., Kirby, D., and Renals, S. (1999). Recognition, indexing and retrieval of British broadcast news with the THISL system. Proc. Eurospeech 99, Budapest, Hungary, pp. 1267-1270.
Singhal, A. and Pereira, F. (1999). Document expansion for speech retrieval. Proc. 17th Annual InternationalACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 34-41.
Spärck Jones, K., Walker, S., and Robertson, S.E. (2000). A probabilistic model of information retrieval: Development and comparative experiments, Parts 1 and 2. Information Processing and Management, 36(6):779-840.
van Mulbregt, P., Carp, I., Gillick, L., Lowe, S., and Yamron, J. (1999). Segmentation of automatically transcribed broadcast news text. Proc.DARPA 1999 Broadcast NewsWorkshop, Herndon, VA, pp. 77-80.
van Rijsbergen, C.J. (1979).Information Retrieval, 2nd ed. Stoneham: MA Butterworths.
Voorhees, E.M. and Harman, D.K. (1999). Overview of the seventh text REtrieval conference (TREC-7). In E.M. Voorhees and D.K. Harman (Eds.), The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242. Gaithersburg, MD: Department of Commerce, National Institute of Standards and Technology, pp. 1-24.
Wold, E., Blum, T., Keslar, D., and Wheaton, J. (1996). Contentbased classification, search and retrieval of audio. IEEE Multimedia, Fall 1996:27-36.
Woodland, P.C., Gales, M.J.F., Pye, D., and Young, S.J. (1997). The development of the 1996 HTK broadcast news transcription system. Proc. DARPA 1997 Speech RecognitionWorkshop, Chantilly, VA, pp. 73-78.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Johnson, S.E., Jourlin, P., Jones, K.S. et al. Information Retrieval from Unsegmented Broadcast News Audio. International Journal of Speech Technology 4, 251–268 (2001). https://doi.org/10.1023/A:1011312708732
Issue Date:
DOI: https://doi.org/10.1023/A:1011312708732