Automated Speech and Audio Analysis for Semantic Access to Multimedia

  • Franciska de Jong
  • Roeland Ordelman
  • Marijn Huijbregts
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4306)


The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives.


Speech Recognition Language Model Automatic Speech Recognition Acoustic Model Speech Recognition System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allauzen, A., Gauvain, J.L.: Diachronic vocabulary adaptation for broadcast news transcription. In: InterSpeech, Lisbon (September 2005)Google Scholar
  2. 2.
    Auzanne, C., Garofolo, J.S., Fiscus, J.G., Fisher, W.M.: Automatic Language Model Adaptation for Spoken Document Retrieval. In: Proceedings of RIAO 2000, Content-Based Multimedia Information Access, pp. 132–141 (2000)Google Scholar
  3. 3.
    Brown, M.G., Foote, J.T., Jones, G.J.F., Sparck Jones, K., Young, S.J.: Automatic Content-based Retrieval of Broadcast News. In: Proceedings of the third ACM international conference on Multimedia, San Francisco, pp. 35–43. ACM Press, New York (1995)CrossRefGoogle Scholar
  4. 4.
    Chase, L.: Blame assignment for errors made by large vocabulary speech recognizers. In: Proceedings Eurospeech 1997, Rhodes, Greece, pp. 1563–1566 (1997)Google Scholar
  5. 5.
    de Jong, F.M.G., Kraaij, W.: Content reduction for cross-media browsing. In: Saggion, H., Minel, J.-L. (eds.) RANLP workshop Crossing Barriers in Text Summarization Reserach, Borovets, Bulgaria, pp. 64–69 (2005)Google Scholar
  6. 6.
    Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M.: The TREC SDR Track: A Success Story. In: Eighth Text Retrieval Conference, Washington, pp. 107–129 (2000)Google Scholar
  7. 7.
    Jourlin, P., Johnson, S.E., Spärck Jones, K., Woodland, P.C.: General Query Expansion Techniques for Spoken Document Retrieval. In: Proc. ESCA Workshop on Extracting Information from Spoken Audio, Cambridge, UK, pp. 8–13 (1999)Google Scholar
  8. 8.
    Kraaij, W., van Gent, J., Ekkelenkamp, R., van Leeuwen, D.: Phoneme based spoken document retrieval. In: Proceedings of the fourteenth Twente Workshop on Language Technology TWLT-14, University of Twente, pp. 141–153 (1998)Google Scholar
  9. 9.
    Moreno, P.J., Joerg, C., Van Thong, J.-M., Glickman, O.: A Recursive Algorithm for the Forced Alignment of Very Long Audio Segments. In: Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP 1998), Sydney, Australia (1998)Google Scholar
  10. 10.
    Oostdijk, N.: The Spoken Dutch Corpus. Overview and first evaluation. In: Gravilidou, M., Carayannis, G., Markantonatou, S., Piperidis, S., Stainhaouer, G. (eds.) Second International Conference on Language Resources and Evaluation, vol. II, pp. 887–894 (2000)Google Scholar
  11. 11.
    Ordelman, R.J.F.: Dutch Speech Recognition in Multimedia Information Retrieval. Phd thesis, University of Twente, Enschede, p.268. Taaluitgeverij Neslia Paniculata, Enschede (2003) ISSN: 1381-3617; No 03-56, ISBN: 90-75296-08-8Google Scholar
  12. 12.
    Siohan, O., Myrvol, T., Lee, C.: Structural maximum a posteriori linear regression for fast hmm adaptation (2000)Google Scholar
  13. 13.
    Smeaton, A.F., Kraaij, W., Over, P.: Trecvid - an overview. In: Proceedings of TRECVID 2003, USA. NIST (2003)Google Scholar
  14. 14.
    Spitters, M., Kraaij, W.: Unsupervised clustering in multilingual news streams. In: Proceedings of the LREC 2002 workshop: Event Modelling for Multilingual Document Linking, pp. 42–46 (2002)Google Scholar
  15. 15.
    Truong, K.P., van Leeuwen, D.A.: Automatic detection of laughter. In: InterSpeech, Lisbon, September 2005, pp. 485–488 (2005)Google Scholar
  16. 16.
    van Leeuwen, D., Huijbregts, M.: The ami speaker diarization system for nist rt06s meeting data. In: NIST 2006 Spring Rich Transcrition Evaluation Workshop, Washington DC, USA (2006)Google Scholar
  17. 17.
    Westerveld, T., de Vries, A.P., Ramírez, G.: Surface features in video retrieval. In: Detyniecki, M., Jose, J.M., Nürnberger, A., van Rijsbergen, C.J.‘. (eds.) AMR 2005. LNCS, vol. 3877, pp. 180–190. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Woodland, P.C., Johnson, S.E., Jourlin, P., Spärck Jones, K.: Effects of Out of Vocabulary Words in Spoken Document Retrieval. In: 2000 ACM SIGIR Conference, pp. 372–374, Athens Greece (2000)Google Scholar
  19. 19.
    Yapanel, U., Hansen, J.H.L.: A new perspective on feature extraction for robust in-vehicle speech recognition. In: Proceedings of Eurospeech, pp. 1281–1284 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Franciska de Jong
    • 1
    • 2
  • Roeland Ordelman
    • 1
  • Marijn Huijbregts
    • 1
  1. 1.Dept. of Computer ScienceUniversity of TwenteEnschedeThe Netherlands
  2. 2.TNO-ICTDelftThe Netherlands

Personalised recommendations