Abstract
Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal.
This is a preview of subscription content, access via your institution.





References
Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Babuka R, Van der Veen P, Kaymak U (2002) Improved covariance estimation for gustafson-kessel clustering. In: FUZZ-IEEE’02, vol 2. IEEE, pp 1081–1085
Barnard M, Holden EJ, Owens R (2002) Lip tracking using pattern matching snakes. In: Proceedings of the 5th Asian conference on computer vision
Barras C, Zhu X, Meignier S, Gauvain J (2006) Multistage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2006) The ami meeting corpus: a pre-announcement. In: Machine learning for multimodal interaction. Springer, pp 28–39
Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C (2008) Stream-based speaker segmentation using speaker factors and eigenvoices. In: IEEE international conference on coustics, speech and signal processing. ICASSP 2008. IEEE, pp 4133–4136
Chu SM, Tang H, Huang TS (2009) Fishervoice and semi-supervised speaker clustering. ICASSP:4089–4092
Cover TM, Thomas JA (1991) Elements of information theory. Wiley
Dalka P, Czyzewski A (2010) Human-computer interface based on visual lip movement and gesture recognition. IJCSA 7(3):124–139
Daugman JG et al (1985) Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Opt Soc Amer J A: Optics Image Sci 2(7):1160–1169
Dielmann A. (2010) Unsupervised detection of multimodal clusters in edited recordings. In: 2010 IEEE international workshop on multimedia signal processing (MMSP). IEEE, pp 177–182
Fiscus J, Garofolo J, Le A, Martin A, Pallett D, Przybocki M, Sanders G (2004) Results of the fall 2004 stt and mde evaluation. In: RT-04F workshop
Fleck MM, Forsyth DA, Bregler C (1996) Finding naked people. In: Computer vision ECCV’96. Springer, pp 593–602
Fodor IK (2002) A survey of dimension reduction techniques. Tech. rep., Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory
Foley D, Sammon J (1975) An optimal set of discriminant vectors. IEEE Trans Comput 100:281–289
Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing. ICASSP 2009. IEEE, pp 4069–4072
Fukunaga K. (1990) Introduction to statistical pattern recognition. Academic Press Limited, Boston
Garau G., Bourlard H. (2010) Using audio and visual cues for speaker diarisation initialisation. In: 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). IEEE, pp 4942–4945
Gargi U, Kasturi R, Strayer SH (2000) Performance characterization of video-shot-change detection methods. IEEE Trans Circ Syst Video Technol 10(1):1–13
Giannakopoulos T, Petridis S (2012) Fisher linear semi-discriminant analysis for speaker diarization. IEEE Trans Audio Speech Lang Process 20(7):1913–1922
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1-2):83–97
Liu C, Wechsler H (2002) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans Image Process 11(4):467–476
Moore D (2002) The idiap smart meeting room
Noulas A, Englebienne G, Krose BJ (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93
Pardo JM, Anguera X, Wooters C (2007) Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Trans Comput 56(9):1212–1224
Schiele B, Crowley JL (2000) Recognition without correspondence using multidimensional receptive field histograms. Int J Comput Vis 36(1):31–50
Seichepine N, Essid S, Févotte C, Cappe O (2013) Soft nonnegative matrix co-factorizationwith application to multimodal speaker diarization. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3537–3541
Soetedjo A, Yamada K (2008) Skin color segmentation using coarse-to-fine region on normalized rgb chromaticity diagram for face detection. IEICE Trans Inf Syst 91(10):2493–2502
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32
Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans ASP 10:293–302. doi:10.1109/TSA.2002.800560
Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: 18th international conference on pattern recognition. ICPR 2006, vol 2. IEEE, pp 1150–1153
Vallet F., Essid S., Carrive J. (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15 (3):509–520
Vendramin L, Campello R, Hruschka E (2009) On the comparison of relative clustering validity criteria. In: SIAM international conference on data mining, pp 733–744
Vinciarelli A (2009) Capturing order in social interactions [social sciences]. IEEE Signal Process Mag 26 (5):133–152
Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: 3rd International conference on affective computing and intelligent interaction and workshops. ACII 2009. IEEE, pp 1–4
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 1. IEEE, pp I–511
Zhang H, Kankanhalli A, Smoliar SW (1993) Automatic partitioning of full-motion video. Multimedia Syst 1(1):10–28
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sarafianos, N., Giannakopoulos, T. & Petridis, S. Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed Tools Appl 75, 115–130 (2016). https://doi.org/10.1007/s11042-014-2274-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2274-x