Bottleneck Based Front-End for Diarization Systems

  • Ignacio ViñalsEmail author
  • Jesús Villalba
  • Alfonso Ortega
  • Antonio Miguel
  • Eduardo Lleida
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10077)


The goal of this paper is to study the inclusion of deep learning into the diarization task. We propose some novel approaches at the feature extraction stage, substituting the classical usage of short-term features, such as MFCCs and PLPs, by Deep Learning based ones. These new features come from the hidden states at bottleneck layers in neural networks. Trained for ASR tasks.

These new features will be included in the University of Zaragoza ViVoLAB speaker diarization system, designed for the Multi-Genre Broadcast (MGB) challenge of the 2015 ASRU Workshop. This system, designed following the i-vector paradigm, uses the input features to segment the input audio and construct one i-vector per segment. These i-vectors will be clustered into speakers according to generative PLDA models.

The evaluation for our new approach will be carried out with broadcast audio from the 2015 MGB Challenge.


Diarization Deep Neural Networks Bottlenecks 


  1. 1.
    Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRefGoogle Scholar
  2. 2.
    Tranter, S.E., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRefGoogle Scholar
  3. 3.
    Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 6, pp. 127–132 (1998)Google Scholar
  4. 4.
    Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. V, pp. 953–956 (2005)Google Scholar
  5. 5.
    Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. (Report) CRIM-06/08-13, CRIM, Montreal, pp. 1–17 (2005)Google Scholar
  6. 6.
    Vaquero, C., Ortega, A., Miguel, A., Lleida, E.: Quality assessment of speaker diarization for speaker characterization. IEEE Trans. Acoust. Speech Lang. Process. 21(4), 816–827 (2013)CrossRefGoogle Scholar
  7. 7.
    Reynolds, D., Kenny, P., Castaldo, F.: A study of new approaches to speaker diarization. In: Interspeech, pp. 1047–1050 (2009)Google Scholar
  8. 8.
    Hinton, G., Deng, L., Dong, Y., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  9. 9.
    Ghalehjegh, S.H., Rose, R.: Deep bottleneck features for I-vector based text-independent speaker verification. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 555–560 (2015)Google Scholar
  10. 10.
    Richardson, F., Reynolds, D., Dehak, N.: A unified deep neural network for speaker and language recognition. In: Interspeech, pp. 1146–1150 (2015)Google Scholar
  11. 11.
    Lei, Y., Scheffer, N., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1714–1718 (2014)Google Scholar
  12. 12.
    Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  13. 13.
    Bell, P., Gales, M.J.F., Thomas Hain, J., Kilgour, P Lanchantin Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.C.: The MGB challenge: evaluating multi-genre broadcast media recognition. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015S, Scottsdale, Arizona, USA, December 2015, vol. 1, no. 1. IEEE (2015)Google Scholar
  14. 14.
    Villalba, J., Ortega, A., Miguel, A., Lleida, E.: Variational Bayesian PLDA for speaker diarization in the MGB challenge. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 667–674 (2015)Google Scholar
  15. 15.
    Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Proceedings of Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 249–252 (2011)Google Scholar
  16. 16.
    Villalba, J., Lleida, E.: Unsupervised adaptation of PLDA by using variational Bayes methods. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 744–748 (2014)Google Scholar
  17. 17.
    Liu, Y., Qian, Y., Chen, N., Fu, T., Zhang, Y., Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)CrossRefGoogle Scholar
  18. 18.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)Google Scholar
  19. 19.
    ETSI. ETSI ES 202 050 Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression (2002)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Ignacio Viñals
    • 1
    Email author
  • Jesús Villalba
    • 2
  • Alfonso Ortega
    • 1
  • Antonio Miguel
    • 1
  • Eduardo Lleida
    • 1
  1. 1.ViVoLAB, Aragón Institute for Engineering Research (I3A)University of ZaragozaZaragozaSpain
  2. 2.Cirrus LogicMadridSpain

Personalised recommendations