Segmentation and Annotation of Audiovisual Recordings Based on Automated Speech Recognition

  • Stephan Repp
  • Jörg Waitelonis
  • Harald Sack
  • Christoph Meinel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4881)


Searching multimedia data in particular audiovisual data is still a challenging task to fulfill. The number of digital video recordings has increased dramatically as recording technology has become more affordable and network infrastructure has become easy enough to provide download and streaming solutions. But, the accessibility and traceability of its content for further use is still rather limited. In our paper we are describing and evaluating a new approach to synchronizing auxiliary text-based material as, e. g. presentation slides with lecture video recordings. Our goal is to show that the tentative transliteration is sufficient for synchronization. Different approaches to synchronize textual material with deficient transliterations of lecture recordings are discussed and evaluated in this paper. Our evaluation data-set is based on different languages and various speakers’ recordings.


Speech Recognition Slide Transition Automatic Speech Recognition Text Segmentation Portable Document Format 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Beeferman, D., Berger, A., Lafferty, J.D.: Statistical models for text segmentation. Machine Learning 34(1-3), 177–210 (1999)zbMATHCrossRefGoogle Scholar
  2. 2.
    Chen, Y., Heng, W.J.: Automatic synchronization of speech transcript and slides in presentation. In: ISCAS. Proceedings of the IEEE International Symposium on Circuits and Systems, Circuits and Systems Society, pp. 568–571 (May 2003)Google Scholar
  3. 3.
    Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of NAACL 2000 (2000)Google Scholar
  4. 4.
    Chu, W.-T., Chen, H.-Y.: Cross-media correlation: a case study of navigated hypermedia documents. In: MULTIMEDIA 2002. Proceedings of the tenth ACM international conference on Multimedia, pp. 57–66. ACM Press, New York, USA (2002)CrossRefGoogle Scholar
  5. 5.
    Galley, M., McKeown, K., Fosler-Lussier, E., Jing, H.: Discourse segmentation of multi-party conversation. In: ACL, pp. 562–569 (2003)Google Scholar
  6. 6.
    Gross, R., Bett, M., Yu, H., Zhu, X., Pan, Y., Yang, J., Waibel, A.: Towards a multimodal meeting record. In: IEEE International Conference on Multimedia and Expo (III), pp. 1593–1596 (2000)Google Scholar
  7. 7.
    Haubold, A., Kender, J.R.: Augmented segmentation and visualization for presentation videos. ACM Multimedia, 51–60 (2005)Google Scholar
  8. 8.
    Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  9. 9.
    Hsueh, P., Moore, J.: Automatic topic segmentation and lablelling in multiparty dialogue. In: First IEEE/ACM workshop on Spoken Language Technology (SLT), Aruba, IEEE Computer Society, Los Alamitos (2006)Google Scholar
  10. 10.
    Hürst, W., Kreuzer, T., Wiesenhütter, M.: A qualitative study towards using large vocabulary automatic speech recognition to index recorded presentations for search and access over the web. In: IADIS Internatinal Conference WWW/Internet (ICWI), pp. 135–143 (2002)Google Scholar
  11. 11.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. J. Comput. Syst. Sci. 65(1), 73–96 (2002)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Ney, H., Ortmanns, S.: Progress in dynamic programming search for lvcsr. Proceedings of the IEEE 88(8), 1224–1240 (2000)CrossRefGoogle Scholar
  14. 14.
    Ngo, C.-W., Wang, F., Pong, T.-C.: Structuring lecture videos for distance learning applications. In: ISMSE. Proceedings of the Multimedia Software Engineering, pp. 215–222 (December 2003)Google Scholar
  15. 15.
    Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1), 19–36 (2002)CrossRefGoogle Scholar
  16. 16.
    Repp, S., Meinel, C.: Segmenting of recorded lecture videos - the algorithm voiceseg. In: ICETE. Proceedings of the 1th Signal Processing and Multimedia Applications, pp. 317–322 (August 2006)Google Scholar
  17. 17.
    Repp, S., Meinel, C.: Semantic indexing for recorded educational lecture videos. In: PERCOMW 2006, Washington, DC, USA, pp. 240–245 (2006)Google Scholar
  18. 18.
    Sack, H., Waitelonis, J.: Integrating social tagging and document annotation for content-based search in multimedia data. In: SAAW 2006. Proc. of the 1st Semantic Authoring and Annotation Workshop, Athens (GA), USA (2006)Google Scholar
  19. 19.
    Yamamoto, N., Ogata, J., Ariki, Y.: Topic segmentation and retrieval system for lecture videos based on spontaneous speech recognition. In: EUROSPEECH. Proceedings of the 8th European Conference on Speech Communication and Technology, pp. 961–964 (September 2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Stephan Repp
    • 1
  • Jörg Waitelonis
    • 2
  • Harald Sack
    • 2
  • Christoph Meinel
    • 1
  1. 1.Hasso-Plattner-Institut für Softwaresystemtechnik GmbH (HPI), P.O. Box 900460, D-14440 PotsdamGermany
  2. 2.Friedrich-Schiller-Universität Jena, Ernst-Abbe-Platz 2-4, D-07743 JenaGermany

Personalised recommendations