Application of New Qualitative Voicing Time-Frequency Features for Speaker Recognition

  • Nidhal Ben Aloui
  • Hervé Glotin
  • Patrick Hebrard
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4642)


This paper presents original and efficient Qualitative Time-Frequency (QTF) speech features for speaker recognition based on a med-term speech dynamics qualitative representation. For each frame of around 150ms, we estimate and binarize a suband voicing activity estimation of 6 frequency subands. We then derive the Allen temporal relations graph between these 6 time intervals. This set of temporal relations, estimated at each frame, feeds a neural network which is trained for speaker recognition. Experiments are conducted on fifty speakers (males and females) of a reference radio database ESTER (40 hours) with continuous speech. Our best model generates around 3% of frame class error, without using information of frame continuity, which is similar to state of the art. Moreover, our QTF generates a simple and light representation using only 15 integers for coding speaker identity.


Speaker Recognition Speech Feature Evaluation Campaign Ester Phase Atomic Relation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska, D., Reynolds, D.A.: A tutorial on text-independent speaker verification. EURASIP Journal on Applied Signal Processing 4, 430–451 (2004)CrossRefGoogle Scholar
  2. 2.
    Hayakawa, S., Takeda, K., Itakura, F.: Speaker Identification Using Harmonic Structure of LP-residual Spectrum. In: Bigün, J., Borgefors, G., Chollet, G. (eds.) AVBPA 1997. LNCS, vol. 1206, pp. 253–260. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  3. 3.
    Greenberg, S., Arai, T., Grant, W.: The role of temporal dynamics in understanding spoken language. Dynamics of Speech Production and Perception Nato Advanced Studies Series, Life and Behavioural Sciences 374, 171–190 (2006)Google Scholar
  4. 4.
    Fletcher, H.: The nature of speech and its interpretation. J. Franklin Inst. 193(6), 729–747 (1922)CrossRefGoogle Scholar
  5. 5.
    Allen, J.B.: How do humans process and recognise speech. IEEE Trans. on Speech and Signal Processing 2(4), 567–576 (1994)CrossRefGoogle Scholar
  6. 6.
    Glotin, H.: Elaboration and comparatives studies of robust adaptive multistream speech recognition using voicing and localisation cues. Inst. Nat. Polytech Grenoble & EPF Lausanne IDIAP (2001)Google Scholar
  7. 7.
    Morris, A., Hagen, A., Glotin, H., Bourlard, H.: Multi-stream adaptive evidence combination for noise robust ASR. int. journ. Speech Communication, special issue on noise robust ASR 17(34), 1–22 (2001)Google Scholar
  8. 8.
    Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, G.: Weighting schemes for audio-visual fusion in speech recognition. In: ICASSP. IEEE int. conf. Acoustics Speech & Signal Process, Salt Lake City-USA (September 2001)Google Scholar
  9. 9.
    Glotin, H.: When Allen J.B. meets Allen J.F.: Quantal Time-Frequency Dynamics for Robust Speech Features. Research Report LSIS 2006.001 Lab Systems and Information Sciences UMR CNRS (2006)Google Scholar
  10. 10.
    Glotin, H.: Dominant speaker detection based on harmonicity for adaptive weighting in audio-visual cocktail party ASR. In: Adaptation methods in speech recognition ISCA Workshop September, Nice (2001)Google Scholar
  11. 11.
    Berthommier, F., Glotin, H.: A new SNR-feature mapping for robust multistream speech recognition. In: Proc. Int. Congress on Phonetic Sciences (ICPhS) Berkeley University Of California, Ed., San Francisco 1 of XIV August, pp. 711–715 (1999)Google Scholar
  12. 12.
    Yumoto, E., Gould, W.J., Bear, T.: Harmonic to noise ratio as an index of the degree of hoarseness. The Acoustic Society of America 1971, 1544–1550 (1982)CrossRefGoogle Scholar
  13. 13.
    Allen, J.F.: An Interval-Based Representation of Temporal Knowledge. In: Proceedings of 7th IJCAI August, pp. 221–226 (1981)Google Scholar
  14. 14.
    Allen, J.F.: Maintaining Knowledge About Temporal Intervals. Communications of the ACM 26(11), 832–843 (1983)zbMATHCrossRefGoogle Scholar
  15. 15.
    Collobert, R., Bengio, S., Marihoz, J.: Torch: a modular machine learning software library. Laboratoire IDIAP IDIAP-RR 02-46 (2002)Google Scholar
  16. 16.
    Gravier, G., Bonastre, J.F., Galliano, S., Geoffrois, E., Mc Tait, K., Choukri, K.: The ESTER evaluation campaign of Rich Transcription of French Broadcast News. In: Language Evaluation and Resources Conference (April 2004)Google Scholar
  17. 17.
    Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.-F., Gravier, G.: The Ester Phase 2: Evaluation Campaign for the Rich Transcription of French Broadcast News. In: European Conf. on Speech Communication and Technology, pp. 1149–1152 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Nidhal Ben Aloui
    • 1
    • 2
  • Hervé Glotin
    • 1
  • Patrick Hebrard
    • 2
  1. 1.Université du Sud Toulon-Var Laboratoire LSIS, B.P. 20 132 - 83 957 La GardeFrance
  2. 2.DCNS - Division SIS, Le Mourillon B.P. 403 - 83 055 Toulon,Email: nidhal.ben-aloui@dcn.frFrance

Personalised recommendations