Abstract
We present a new application in the field of impulse neurons: audio-visual speech recognition. The features extracted from the audio (cepstral coefficients) and the video (height, width of the mouth, percentage of black and white pixels in the mouth) are sufficiently simple to consider a real time integration of the complete system. A generic preprocessing makes it possible to convert these features into an impulse sequence treated by the neural network which carries out the classification. The training is done in one pass: the user pronounces once all the words of the dictionary. The tests on the European M2VTS Data Base shows the interest of such a system in audio-visual speech recognition. In the presence of noise in particular, the audio-visual recognition is much better than the recognition based on the audio modality only.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
P. de Cuetos, N. Chalapathy, and W. Andrew. Audio-visual intent-to-speak detection for human-computer interaction. In ICASSP, 2000.
P. Delmas, P.Y. Coulon, and V. Fristot. Automatic snakes for robust lip boudaries extraction. In ICASSP, 1999.
S. Dupont and J. Luettin. Audio-visual Speech modeling for continuous speech recognition. IEEE Transactions on multimedia, 2000.
S. Durand and F. Alexandre. Learning Speech as acoustic sequences with the unsupervised model, tom. In NEURAP, 8th International Conference on Neural Networks and their Applications, 1995.
A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Kluwer Acad. Pub., 1991.
J. Luettin. Visual Speech and speaker recognition. In PhD Dissertation, Univ. of Sheffield, 1997.
D. Mercier and R. Séguier. Spiking neurons (stanns) in speech recognition. In 3rd WSEAS International Conference on Neural Network and Applications, Feb 2002.
N. Mozayyani, A. R. Baig, and G. Vaucher. A fully neural solution for on-line handwritten Character recognition. In IJCNN, 1998.
S. Pigeon. M2vts. In http://www.tele.zacl.ac.be/PROJECTS/M2VTS/m2fdb.html, 1996.
Gerasimos Potamianos and Chalapathy Neti. Automatic speechreading of impaired Speech. In Audio-Visual Speech Processing, September 2001.
R. Séguier, N. Cladel, C. Foucher, and D. Mercier. Lipreading with spiking neurons: One pass learning. In International Conference in Central Europe on Computer Graphits, Visualization and Computer Vision, Feb 2002.
R. Séguier, A. LeGlaunec, and B. Loriferne. Human faces detection and tracking in video sequence. In Proc. 7th Portuguese Conf. on Pattern Recognition, 1995.
R. Séguier and David Mercier. A generic pretreatment for spiking-neuron. Application on lipreading with stann (spatio-temporal artificial neural networks). In International Conference on Artificial Neural Networks and Genetic Algorithms, 2001.
Y. Tian, T. Panade, and J. F. Cohn. Recognizing action units for facial expression analysis. IEEE Trans. ora Patterra Analysis and Machine Iatelligence, 2001.
G. Vaucher. An algebraic interpretation of psp composition. In BioSystems, Vol 48, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Séguier, R., Mercier, D. (2002). Audio-Visual Speech Recognition One Pass Learning with Spiking Neurons. In: Dorronsoro, J.R. (eds) Artificial Neural Networks — ICANN 2002. ICANN 2002. Lecture Notes in Computer Science, vol 2415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46084-5_195
Download citation
DOI: https://doi.org/10.1007/3-540-46084-5_195
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44074-1
Online ISBN: 978-3-540-46084-8
eBook Packages: Springer Book Archive