Abstract
In this paper, we present a framework to evaluate the human corrections of a speaker diarization system. We propose four elementary actions to correct the diarization (“Create a boundary”, “Delete a boundary”, “Create a speaker label” and “Change the speaker label”) and we propose an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework is evaluated using French broadcast news drawn from the following campaigns: REPERE, ESTER and ETAPE.
Similar content being viewed by others
Notes
The modification of the code of Transcriber is available at https://git-lium.univ-lemans.fr/broux/transcriber-log.
References
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: a review of recent research. IEEE Transactions on Audio, Speech and Language Processing, 20, 356–370.
Arora, S., Nyberg, E. & Rosé, C.P. (2009). Estimating annotation cost for active learning in a multi-annotator environment, Proceedings of the 2009 ACL International Workshop on Active Learning for Natural Language Processing, (pp. 18–26).
Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., et al. (2009). Statistical approaches to computer-assisted translation. Computational Linguistics, 35, 3–28.
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2001). Transcriber: development and use of a tool for assisting speech corpora production. Speech Communication, 33, 5–22.
Bazillon, T., Estève, Y. & Luzzati, D. (2008). Manual vs assisted transcription of prepared and spontaneous speech, Proceedings of the 6th ELRA International Conference on Language Resources and Evaluation (LREC).
Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T. & Wellekens, C. (2000). A speaker tracking system based on speaker turn detection for NIST evaluation, Proceedings of the 2000 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, (pp. 1177–1180).
Broux, P.A., Doukhan, D., Petitrenaud, S., Meignier, S. & Carrive, J. (2016). An active learning method for speaker identity annotation in audio recordings, Proceedings of the 1st EurAI International Workshop on Multimodal Media Data Analytics (MMDA).
Budnik, M., Poignant, J., Besacier, L. & Quénot, G. (2014). Automatic propagation of manual annotations for multimodal person identification in TV shows, Proceedings of the 12th IEEE International Workshop on Content-Based Multimedia Indexing (CBMI), 1–4
Charhad, M., Moraru, D., Ayache, S. & Quénot, G. (2005). Speaker identity indexing in audio-visual documents, Proceedings of the 4th IEEE International Workshop on Content-Based Multimedia Indexing (CBMI).
De Bra, P., Kobsa, A., & Chin, D. (2010). User modeling, adaptation, and personalization. Lecture Notes in Computer Science, 6075.
Dix, A. (2009). Human–computer interaction. Encyclopedia of database systems, (pp. 1327–1331).
Dufour, R., Jousse, V., Estève, Y., Béchet, F. & Linarès, G. (2009). Spontaneous speech characterization and detection in large audio database, Proceedings of the 13th International Conference on Speech and Computer (SPECOM).
Dupuy, G., Meignier, S., Deléglise, P. & Esteve, Y. (2014). Recent improvements on ILP-based clustering for broadcast news speaker diarization, Proceedings of the 2014 ISCA International Workshop on Speaker and Language Recognition (Odyssey).
Fischer, G. (2001). User modeling in human–computer interaction. User Modeling and User-Adapted Interaction, 11, 65–86.
Galibert, O. (2013). Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech, Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH), (pp. 1131–1134).
Galibert, O. & Kahn, J. (2013). The first official REPERE evaluation, Proceedings of the 1st ISCA International Workshop on Speech, Language and Audio in Multimedia (SLAM)
Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J.F. & Gravier, G. (2005). The ESTER phase II evaluation campaign for the rich transcription of French broadcast news, Proceedings of the 9th ISCA European Conference on Speech Communication and Technology (INTERSPEECH-EUROSPEECH).
Galliano, S., Gravier, G. & Chaubard, L. (2009). The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts, Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH).
Kahn, J. (2011). Parole de locuteur : Performance et confiance en identification biométrique vocale.
Laurent, A. (2010). Auto-adaptation et reconnaissance automatique de la parole.
Laurent, A., Meignier, S., Merlin, T. & Deléglise, P. (2011). Computer-assisted transcription of speech based on confusion network reordering, Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4887.
Meignier, S. & Merlin, T. (2010). LIUM SpkDiarization: an open source toolkit for diarization, CMU SPUD Workshop.
McCowan, I.A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P. & Bourlard, H. (2004). On the use of information retrieval measures for speech recognition evaluation.
NIST, The Rich Transcription Spring 2003 (RT-03S) Evaluation Plan, (2003).
Ordelman, R., De Jong, F. & Larson, M. (2009). Enhanced multimedia content access and exploitation using semantic speech retrieval, Proceedings of the 2009 IEEE International Conference on Semantic Computing (ICSC), (pp. 521–528).
Toselli, A. H., Vidal, E., & Casacuberta, F. (2011). Computer assisted transcription of speech signals. Multimodal Interactive Pattern Recognition and Applications, 99–117.
Trost, H., Matiasek, J., & Baroni, M. (2005). The language component of the FASTY text prediction system. Applied Artificial Intelligence, 19, 743–781.
Vallet, F., Uro, J., Andriamakaoly, J., Nabi, H., Derval, M. & Carrive, J. (2016). Speech Trax: a bottom to the top approach for speaker tracking and indexing in an archiving context, Proceedings of the 10th ELRA International Conference on Language Resources and Evaluation (LREC).
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A. & Sloetjes, H. (2006). ELAN: a professional framework for multimodality research, Proceedings of the 5th ELRA International Conference on Language Resources and Evaluation (LREC).
Wood, M. E. J., & Lewis, E. (1996). Windmill-the use of a parsing algorithm to produce predictions for disabled persons. Institute of Acoustics, 18, (pp. 315–322).
Zaphiris, P. & Ang, C.S. (2008). Cross-disciplinary advances in human computer interaction: user modeling, social computing. Information Science Reference.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Broux, PA., Petitrenaud, S., Meignier, S. et al. Evaluating human corrections in a computer-assisted speaker diarization system. Lang Resources & Evaluation 55, 151–172 (2021). https://doi.org/10.1007/s10579-020-09493-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-020-09493-6