Shot Classification and Keyframe Detection for Vision Based Speakers Diarization in Parliamentary Debates

  • Pedro A. Marín-Reyes
  • Javier Lorenzo-Navarro
  • Modesto Castrillón-Santana
  • Elena Sánchez-Nielsen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9868)

Abstract

Automatic labelling of speakers is an essential task for speakers diarization in parliamentary debates given the huge amount of video data to annotate. In this paper, we address the speaker diarization problem as a visual speaker re-identification issue with a special emphasis on the analysis of different shot types. We propose two approaches that makes use of convolutional neural networks (CNN) and biometric traits for keyframe extraction. Experimental results have been evaluated with challenging real-world datasets from the Canary Islands Parliament, and contrasted with a similar approach that does not analyze the shot type. Results show that the use of CNN for shot classification and biometric traits help to improve the performance of the re-identification outcomes in an average rate of 9.8 %.

Keywords

Visual diarization Re-identification CNN classification Biometric traits 

References

  1. 1.
    Barra-Chicote, R., Pardo, J.M., Ferreiros, J., Montero, J.M.: Speaker diarization based on intensity channel contribution. IEEE Trans. Audio Speech Lang. Process. 19(4), 754–761 (2011)CrossRefGoogle Scholar
  2. 2.
    Castrillón, M., Déniz, O., Hernández, D., Lorenzo, J.: A comparison of face and facial feature detectors based on the violajones general object detection framework. Mach. Vis. Appl. 22(3), 481–494 (2011)Google Scholar
  3. 3.
    Cong, D.-N.T., Khoudour, L., Achard, C., Meurie, C., Lezoray, O.: People re-identification by spectral classification of silhouettes. Sig. Process. 90(8), 2362–2374 (2010). Special Section on Processing and Analysis of High-Dimensional Masses of Image and Signal DataCrossRefMATHGoogle Scholar
  4. 4.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, London (2006)MATHGoogle Scholar
  5. 5.
    Garau, G., Bourlard, H.: Using audio and visual cues for speaker diarisation initialisation. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4942–4945 (2010)Google Scholar
  6. 6.
    Kapsouras, I., Tefas, A., Nikolaidis, N., Peeters, G., Benaroya, L., Pitas. I.: Multimodal speaker clustering in full length movies. Multimed. Tools Appl. 1–20 (2016). doi:10.1007/s11042-015-3181-5
  7. 7.
    Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Describable visual attributes for face verification and image search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1962–1977 (2011)CrossRefGoogle Scholar
  8. 8.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  9. 9.
    Noulas, A., Englebienne, G., Krose, B.J.A.: Multimodal speaker diarization. IEEE Trans. Pattern Anal. Mach. Intell. 34(1), 79–93 (2012)CrossRefGoogle Scholar
  10. 10.
    Sánchez-Nielsen, E., Chávez-Gutiérrez, F., Lorenzo-Navarro, J., Castrillón-Santana, M.: A multimedia system to produce and deliver video fragments on demand on parliamentary websites. Multimed. Tools Appl. 1–27 (2016). doi:10.1007/s11042-016-3306-5
  11. 11.
    Sao, N., Mishra, R.: A survey based on video shot boundary detection techniques. Int. J. Adv. Res. Comput. Commun. Eng. (IJARCCE) 3(4) (2014)Google Scholar
  12. 12.
    Sarafianos, N., Giannakopoulos, T., Petridis, S.: Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed. Tools Appl. 75(1), 115–130 (2016)CrossRefGoogle Scholar
  13. 13.
    Sujatha, C., Mudenagudi, U.: A study on keyframe extraction methods for video summary. In: 2011 International Conference on Computational Intelligence and Communication Networks (CICN), pp. 73–77 (2011)Google Scholar
  14. 14.
    Teixeira, T., Dublon, G., Savvides, A.: A survey of human-sensing: methods for detecting presence, count, location, track, and identity. ACM Comput. Surv. 5, 1–77 (2010)Google Scholar
  15. 15.
    Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRefGoogle Scholar
  16. 16.
    Vallet, F., Essid, S., Carrive, J.: A multimodal approach to speaker diarization on TV talk-shows. IEEE Trans. Multimed. 15(3), 509–520 (2013)CrossRefGoogle Scholar
  17. 17.
    Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 151–173 (2004)CrossRefGoogle Scholar
  18. 18.
    Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: a literature survey. Assoc. Comput. Mach. 35(4), 399–458 (2003). http://doi. acm.org/10.1145/954339.954342

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Pedro A. Marín-Reyes
    • 1
  • Javier Lorenzo-Navarro
    • 1
  • Modesto Castrillón-Santana
    • 1
  • Elena Sánchez-Nielsen
    • 2
  1. 1.Instituto Universitario SIANIUniversidad de Las Palmas de Gran CanariaLas PalmasSpain
  2. 2.Departamento de Ingeniería Informática y de SistemasUniversidad de la LagunaSanta Cruz de TenerifeSpain

Personalised recommendations