The 2006 Athens Information Technology Speech Activity Detection and Speaker Diarization Systems

  • Elias Rentzeperis
  • Andreas Stergiou
  • Christos Boukis
  • Aristodemos Pnevmatikakis
  • Lazaros C. Polymenakos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4299)


This paper describes the Speech Activity Detection (SAD) and Speaker Diarization (SPKR) systems that were developed by the Athens Information Technology in the scope of the NIST RT-06S evaluations. The SAD system performs classification of recorded frames into speech and non-speech, using Linear Discriminant Analysis (LDA), while the SPKR one initially segments recordings into speech intervals based on the Bayesian Information Criterion (BIC), and then applies a two-step clustering strategy to group segments from the same speaker together. Following a discussion of the intrinsics of the two systems, we report and comment on our results on the RT-06S corpus [20].


Linear Discriminant Analysis Bayesian Information Criterion Gaussian Mixture Model Voice Activity Detection Smart Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Weiser, M.: The Computer for the 21st Century. Scientific American 265(3), 66–75 (1991)CrossRefGoogle Scholar
  2. 2.
    Waibel, A., Steusloff, H., Stiefelhagen, R., et al.: CHIL: Computers in the Human Interaction Loop. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Lisbon, Portugal (April 2004)Google Scholar
  3. 3.
    Pnevmatikakis, A., Talantzis, F., Soldatos, J., Polymenakos, L.: Robust Multimodal Audio-Visual Processing for Advanced Context Awareness in Smart Spaces. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations (AIAI 2006), pp. 290–301. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
  5. 5.
    Katsarakis, N., Souretis, G., Talantzis, F., Pnevmatikakis, A., Polymenakos, L.: 3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 45–54. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: A Decision Fusion System across Time and Classifiers for Audio-visual Person Identification. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  7. 7.
    Stergiou, A., Pnevmatikakis, A., Polymenakos, L.: Enhancing the Performance of a GMM-based Speaker Identification System in a Multi-Microphone Setup. In: INTERSPEECH 2006, Pittsburgh (accepted, September 2006)Google Scholar
  8. 8.
    Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. The Bell System Technical Journal 54, 297 (1975)Google Scholar
  9. 9.
    Li, K., Swamy, N.S., Ahmad, M.O.: An Improved Voice Activity Detection Using Higher Order Statistics. IEEE Transactions on Speech and Audio Processing 13(5) (September 2005)Google Scholar
  10. 10.
    Stegmann, J., Schroeder, G.: Robust Voice Activity Detection Based on the Wavelet Transform. In: Proc. IEEE Workshop on Speech Coding For Telecommunications, Pocono Manor, Pennsylvania, USA, pp. 99–100 (September 1997)Google Scholar
  11. 11.
    Reynolds, D.A., Rose, R.C., Smith, M.J.T.: PC-Based TMS320C30 Implementation of the Gaussian Mixture Model Text-Independent Speaker Recognition System. In: International Conference on Signal Processing Applications and Technology, Hyatt Regency, Cambridge, Massachusetts, pp. 967–973 (November 1992)Google Scholar
  12. 12.
    Martin, A., Charlet, C., Mauary, L.: Robust Speech/Non- Speech Detection Using LDA Applied to MFCC. IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City (2001)Google Scholar
  13. 13.
    Duda, R., Hart, R., Stork, D.: Pattern Classification. Wiley-Interscience, New York (2001)MATHGoogle Scholar
  14. 14.
    Rabiner, L., Schafer, R.: Digital Processing of Speech Signals. Prentice Hall Series in Signal Processing (September 1978)Google Scholar
  15. 15.
    Wu, T.-Y., Lu, L., Chen, K., Zhang, H.-J.: Universal Background Models for Real-Time Speaker Change Detection. In: MMM 2003, pp. 135–149 (2003)Google Scholar
  16. 16.
    Moraru, D., Meignier, S., Fredouille, C., Besacier, L., Bonastre, J.-F.: The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation. In: Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP 2004), Montreal, Canada (2004)Google Scholar
  17. 17.
    Gauvain, J.L., Lamel, L., Adda, G.: Partitioning and transcription of broadcast news data. In: International Conference on Speech and Language Processing, Sydney, Australia, vol. 4, pp. 1335–1338 (December 1998)Google Scholar
  18. 18.
    Tritschler, A., Gopinath, R.: Improved speaker segmentation and segments clustering using the Bayesian Information Criterion. In: Proc. of Eurospeech, pp. 679–682 (1999)Google Scholar
  19. 19.
    Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3(1), 72–83 (1995)CrossRefGoogle Scholar
  20. 20.
    Fiscus, J.: Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation Plan (v2) (2006),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Elias Rentzeperis
    • 1
  • Andreas Stergiou
    • 1
  • Christos Boukis
    • 1
  • Aristodemos Pnevmatikakis
    • 1
  • Lazaros C. Polymenakos
    • 1
  1. 1.Autonomic & Grid Computing GroupAthens Information TechnologyAthensGreece

Personalised recommendations