Advertisement

Language Resources and Evaluation

, Volume 44, Issue 3, pp 205–219 | Cite as

WOZ acoustic data collection for interactive TV

  • Alessio Brutti
  • Luca Cristoforetti
  • Walter Kellermann
  • Lutz Marquardt
  • Maurizio Omologo
Article

Abstract

This paper describes a multichannel acoustic data collection recorded under the European DICIT project, during Wizard of Oz (WOZ) experiments carried out at FAU and FBK-irst laboratories. The application of interest in DICIT is a distant-talking interface for control of interactive TV working in a typical living room, with many interfering devices. The objective of the experiments was to collect a database supporting efficient development and tuning of acoustic processing algorithms for signal enhancement. In DICIT, techniques for sound source localization, multichannel acoustic echo cancellation, blind source separation, speech activity detection, speaker identification and verification as well as beamforming are combined to achieve a maximum possible reduction of the user speech impairments typical of distant-talking interfaces. The collected database permitted to simulate at preliminary stage a realistic scenario and to tailor the involved algorithms to the observed user behaviors. In order to match the project requirements, the WOZ experiments were recorded in three languages: English, German and Italian. Besides the user inputs, the database also contains non-speech related acoustic events, room impulse response measurements and video data, the latter used to compute three-dimensional positions of each subject. Sessions were manually transcribed and segmented at word level, introducing also specific labels for acoustic events.

Keywords

Multimodal Corpus annotation Audio 

Notes

Acknowledgments

This work was partially funded by the Commission of the European Community, Information Society Technologies (IST), FP6 IST-034624, under DICIT.

References

  1. Brayda, L., Bertotti, C., Cristoforetti, L., Omologo, M., & Svaizer, P. (2005). Modifications on NIST MarkIII array to improve coherence properties among input signals. In Proceedings of AES, 118th audio engineering society convention, Barcelona, Spain.Google Scholar
  2. Cristoforetti, L., Omologo, M., Matassoni, M., Svaizer, P., & Zovato E. (2000). Annotation of a multichannel noisy speech corpus. In Proceedings of LREC 2000, Athens, Greece.Google Scholar
  3. Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18, 859–872.CrossRefGoogle Scholar
  4. Goronzy, S., & Beringer, N. (2005). Integrated development and on-the-fly simulation of multimodal dialogs. In Proceedings of interspeech 2005, Lisbon, Portugal (pp. 2477–2480).Google Scholar
  5. Huang, Y., & Benesty, J. (2004). Audio signal processing for next-generation multimedia communication systems. Boston: Kluwer.CrossRefGoogle Scholar
  6. Kellermann, W. (1991). A self-steering digital microphone array. In Proceedings of ICASSP 1991, Toronto, Canada.Google Scholar
  7. Lanz, O. (2006). Approximate bayesian multibody tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1436–1449.CrossRefGoogle Scholar
  8. Temko, A., Malkin, R., Nadieu, C., Zieger, C., Macho, D., & Omologo, M. (2006). CLEAR evaluation of acoustic event detection and classification systems. CLEAR’06 evaluation campaign and workshop. Southampton, UK: Springer.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Alessio Brutti
    • 1
  • Luca Cristoforetti
    • 1
  • Walter Kellermann
    • 2
  • Lutz Marquardt
    • 2
  • Maurizio Omologo
    • 1
  1. 1.Fondazione Bruno Kessler (FBK)–irstPovo (TN)Italy
  2. 2.Multimedia Communications and Signal ProcessingUniversity of Erlangen-Nuremberg (FAU)ErlangenGermany

Personalised recommendations