Skip to main content

The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 5916)

Abstract

Collecting an audio visual data corpus based on the linguistic rules is an unquestionable, must-take step in order to conduct major research in multimedia fields as AVSR, lip synchronization and visual speech synthesis. Building up a reliable data corpus where it covers all phonemes in all phonemic combinations of a language is a difficult and time consuming task. To partially deal with this problem, in this research, vc, cv and vcv combinations, instead of the entire possible phonemic combinations were used, where they carry the most language information. This paper gives an indication on the new data corpus, capturing 14 respondents. To better perceive coarticulation effect in speech, continuous speech was considered other than isolated and continuous digits. This makes the collection process a more time and cost-saving one, maintaining the efficiency high.

Keywords

  • Audio visual database design
  • linguistic approach
  • coarticulation
  • Persian data corpus
  • multimedia modeling
  • Farsi audio visual data corpus
  • AVA II

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-11301-7_30
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   129.00
Price excludes VAT (USA)
  • ISBN: 978-3-642-11301-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   169.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bailly-Bailliere, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariethoz, J., Matas, J., Messer, K., Popovici, V., Poree, F., Ruiz, B., Thiran, J.P.: The BANCA Database and Evaluation Protocol. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 625–638. Springer, Heidelberg (2003)

    CrossRef  Google Scholar 

  2. Movellan, J.R.: Visual speech recognition with stochastic networks. In: Tesauro, G., Toruetzky, D., Leen, T. (eds.) Advances in Neutral Information Processing Systems, vol. 7. MIT Press, Cambridge (1995/2000); Neti, C., Pontamianos, G., Luettin, J., Matthews, I.

    Google Scholar 

  3. Chibelushi, C.C., Deravi, F., Mason, J.S.D.: Survey of audio visual speech databases. Tech. Rep., Department of Electrical and Electronic Engineering, University of Wales, Swansea, UK (1996)

    Google Scholar 

  4. ChiŃu, A.G., Rothkrantz, L.J.M.: Building a Data Corpus for Audio-Visual Speech Recognition. In: AGC, April 2007, pp. 88–92 (2007)

    Google Scholar 

  5. Cisar, P., Zelezny, M., Krnoul, Z., Kanis, J., Zelinka, J., Müller, L.: Design and recording of Czech speech corpus for audio-visual continuous speech recognition. In: Proceedings of the Auditory-Visual Speech Processing International Conference, AVSP 2005, Vancouver Island, pp. 1–4 (2005)

    Google Scholar 

  6. Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14(5), 449–480 (2004)

    CrossRef  Google Scholar 

  7. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America 120(5), 2421–2424 (2006)

    CrossRef  Google Scholar 

  8. Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes. International Journal of Computer Vision 38, 45–57 (2000)

    MATH  CrossRef  Google Scholar 

  9. Movalleli, G.: Sara Lip-Reading Test: Construction, Evaluation and operating on a group of people with hearing disorder. MSc Thesis, Department of Rehabilitation in Tehran University of medical sciences (2002) (in Persian)

    Google Scholar 

  10. Goecke, R., Millar, J.B., Zelinsky, A., Robert-Ribes, J.: A detailed description of the AVOZES data corpus. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), Salt Lake City, Utah, USA, May 2001, pp. 486–491 (2001)

    Google Scholar 

  11. Goecke, R., Millar, J.B.: The Audio-Video Australian English Speech Data Corpus AVOZES. In: Proceedings of the 8th International Conference on Spoken Language Processing, ICSLP 2004, vol. III, pp. 2525–2528 (2004)

    Google Scholar 

  12. Grimm, M., Narayanan, S.: The Vera am Mittag German audio-visual emotional speech database. In: ICME 2008, pp. 865–868. IEEE, Los Alamitos (2008)

    Google Scholar 

  13. Jahani, C.: The Glottal Plosive: A Phoneme in Spoken Modern Persian or Not? In: Csató, É.Á., Isaksson, B., Jahani, C. (eds.) Linguistic Convergence and Areal Diffusion: Case studies from Iranian, Semitic and Turkic, pp. 79–96. RoutledgeCurzon, London (2005)

    Google Scholar 

  14. Meng, H.M., Ching, P.C., Lee, T., Mak, M.-W., Mak, B., Moon, Y.S., Siu, M.-H., Tang, X., Hui, H.P.S., Lee, A., Lo, W.-K., Ma, B., Sio, E.K.T.: The Multi-Biometric, Multi-Device and Multilingual (M3) Corpus. In: Proceedings of The Second Workshop on Multimodal User Authentication (MMUA), Toulouse, France, May 11-12 (2006)

    Google Scholar 

  15. Ladefoged, P.: Vowels and Consonants, 2nd edn. Blackwell Publishers Pub., Malden (2004)

    Google Scholar 

  16. Lee, B., Hasegawa-Johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., Huang, T.: AVICAR: Audio- Visual Speech Corpus in a Car Environment. In: Proceedings of International Conference on Spoken Language Processing – INTERSPEECH, Jeju Island, Korea, October 4-8 (2004)

    Google Scholar 

  17. Liangi, L., Luo, Y., Huang, F., Nefian, A.V.: A multi-stream audio-video large-vocabulary mandarin Chinese speech database. In: IEEE International Conference on Multimedia and Expo., vol. 3, pp. 1787–1790 (2004)

    Google Scholar 

  18. Marassa, L.K., Lansing, C.R.: Visual Word Recognition in Two Facial Motion Conditions: full-face versus Lip plus Mandible. Journal of speech and hearing Research 38(6), 1387–1394 (1995)

    Google Scholar 

  19. Messer, K., Matas, J., Kittler, J., Luettin, J.: XM2VTSDB: the extended M2VTS database. In: Proceedings of the 2nd International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA 1999), Washington, DC, USA, March 1999, pp. 72–77 (1999)

    Google Scholar 

  20. Millar, J.B., Wagner, M., Goecke, R.: Aspects of Speaking-Face Data Corpus Design Methodology. In: Proc. 8th Int. Conf. Spoken Language Processing, ICSLP, Jeju, Korea, vol. II, pp. 1157–1160 (2004)

    Google Scholar 

  21. Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S.M., Tyagi, A., Casas, J.R., Turmo, J., Christoforetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., Rochet, C.: The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Journal of Language Resources and Evaluation 41, 389–407 (2008)

    CrossRef  Google Scholar 

  22. Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP Journal on Applied Signal Processing 2002, 1189–1201 (2002)

    CrossRef  Google Scholar 

  23. Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J.N.: CUAVE: a new audio-visual database for multimodal human computer-interface research. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Orlando, Fla, USA, May 2002, vol. 2, pp. 2017–2020 (2002)

    Google Scholar 

  24. Pera, V., Moura, A., Freitas, D.: LPFAV2: a New Multi-Modal Database for Developing Speech Recognition Systems for an Assistive Technology Application. In: SPECOM 2004: 9th Conference Speech and Computer, St. Petersburg, Russia, September 20-22 (2004)

    Google Scholar 

  25. Pigeon, S., Vandendorpe, L.: The m2vts multimodal face database (release 1.00). In: Bigün, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 403–409. Springer, Heidelberg (1997)

    CrossRef  Google Scholar 

  26. Trojanová, J., Hrúz, M., Campr, P., Železný, M.: Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)

    Google Scholar 

  27. Wojdeł, J.C., Wiggers, P., Rothkrantz, L.J.M.: An audiovisual corpus for multimodal speech recognition in Dutch language. In: Proceedings of the International Conference on Spoken Language Processing, ICSLP 2002, Denver CO, USA, September 2002, pp. 1917–1920 (2002)

    Google Scholar 

  28. Samareh, Y.: Persian phonetics. Markaze Nashre Daneshgahi Pub., Tehran (1998) (in Persian)

    Google Scholar 

  29. Yotsukura, T., Nakamura, S., Morishima, S.: Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation. The IEICE Transaction on Information and System E 88-D(11), 2377–2483 (2005)

    Google Scholar 

  30. Chen, T.: Audiovisual speech processing. IEEE Signal Processing Mag. 18, 9–21 (2001)

    MATH  CrossRef  Google Scholar 

  31. International Phonetic Association. In: Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet, pp. 124–125. Cambridge University Press, Cambridge (1999) ISBN 978-0521637510

    Google Scholar 

  32. Chaudhari, U.V., Ramaswamy, G.N.: Information fusion and decision cascading for audiovisual speaker recognition based on time-varying stream reliability prediction. Paper presented at the Int. Conf. Multimedia Expo. (1999)

    Google Scholar 

  33. Chibelushi, C.C., Gandon, S., Mason, J.S.D.: Design issue for a digital audio-visual integrated database. Integrated Audio-visual Processing for Recognition, Synthesis and Communication (1996)

    Google Scholar 

  34. Fox, N.A., O’Mullane, B., Reilly, R.B.: The realistic multi-modal VALID database and visual speaker identification comparison experiments. Paper presented at the 5th International Conference on Audio- and Video-Based Biometric Person Authentication (2005)

    Google Scholar 

  35. Bastanfard, A., Fezel, M., Kelishami, A.A., Aghaahmadi, M.: A comprehensive audio-visual corpus for teaching sound Persian phoneme articulation. In: IEEE International Conference on Systems, Man, and Cybernetics (accepted, 2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bastanfard, A., Fazel, M., Kelishami, A.A., Aghaahmadi, M. (2010). The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, YP.P. (eds) Advances in Multimedia Modeling. MMM 2010. Lecture Notes in Computer Science, vol 5916. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11301-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11301-7_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11300-0

  • Online ISBN: 978-3-642-11301-7

  • eBook Packages: Computer ScienceComputer Science (R0)