Abstract
In audio and speech processing, accurate detection of the changing points between multiple speakers in speech segments is an important stage for several applications such as speaker identification and tracking. Bayesian Information Criteria (BIC)-based approaches are the most traditionally used ones as they proved to be very effective for such task. The main criticism levelled against BIC-based approaches is the use of a penalty parameter in the BIC function. The use of this parameters consequently means that a fine tuning is required for each variation of the acoustic conditions. When tuned for a certain condition, the model becomes biased to the data used for training limiting the model’s generalisation ability.
In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each tree is trained using a sampled version of the speech segment. During the tree construction process, a set of randomly selected points in the input sequence is examined as potential segmentation points. The point that yields the highest ΔBIC is chosen and the same process is repeated for the resultant left and right segments. The tree is constructed where each node corresponds to the highest ΔBIC with the associated point index. After building the forest and using all trees, the accumulated ΔBIC for each point is calculated and the positions of the local maximums are considered as speaker changing points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F 1 compared with the standard ΔBIC with optimally tuned penalty parameter.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abou-Zleikha, M., , Tan, Z.H., Christensen, M.G., Jensen, S.H.: Non-linguistic vocal event detection and localisation using online random forest. In: Proceedings of 37th International Convention of Information and Communication Technology (MIPRO). IEEE (2014)
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Signal Processing Letters 11(8), 649–651 (2004)
Anguera Miro, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 356–370 (2012)
Ben, M., Betser, M., Bimbot, F., Gravier, G.: Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted gmms. In: Proceedings of ICSLP (2004)
Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.: A speaker tracking system based on speaker turn detection for nist evaluation. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II1177–II1180. IEEE (2000)
Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998)
Cheng, S.S., Wang, H.M., Fu, H.C.: Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing 18(1), 141–157 (2010)
Grašič, M., Kos, M., Kačič, Z.: Online speaker segmentation and clustering using cross-likelihood ratio calculation with reference criterion selection. IET signal processing 4(6), 673–685 (2010)
Kotti, M., Benetos, E., Kotropoulos, C.: Automatic speaker change detection with the bayesian information criterion using mpeg-7 features and a fusion scheme. In: IEEE International Symposium on Circuits and Systems, p. 4. IEEE (2006)
Kumar, A., Dighe, P., Singh, R., Chaudhuri, S., Raj, B.: Audio event detection from acoustic unit occurrence patterns. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 489–492 (2012)
Lamel, L.F., Kassel, R.H., Seneff, S.: Speech database development: Design and analysis of the acoustic-phonetic corpus. In: Speech Input/Output Assessment and Speech Databases (1989)
Li, R., Schultz, T., Jin, Q.: Improving speaker segmentation via speaker identification and text segmentation. In: Proceedings of INTERSPEECH 2009 (2009)
Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II–5. IEEE (2003)
Mohammadi, S.H., Sameti, H., Langarani, M.S.E., Tavanaei, A.: Knndist: A non-parametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH (2012)
Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 413–416. IEEE (2001)
Moschou, V., Kotti, M., Benetos, E., Kotropoulos, C.: Systematic comparison of bic-based speaker segmentation systems. In: Proceedings of IEEE 9th Workshop on Multimedia Signal Processing, pp. 66–69. IEEE (2007)
Rong, J., Li, G., Chen, Y.P.P.: Acoustic feature selection for automatic emotion recognition from speech. Information processing & management 45(3), 315–328 (2009)
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., et al.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of INTERSPEECH, vol. 2007, pp. 1–4 (2007)
Tritschler, A., Gopinath, R.A.: Improved speaker segmentation and segments clustering using the bayesian information criterion. In: Proceedings of Eurospeech, vol. 99, pp. 679–682 (1999)
Vandecatseye, A., Martens, J.P., Neto, J.P., Meinedo, H., Garcia-Mateo, C., Dieguez-Tirado, J., Mihelic, F., Zibert, J., Nouza, J., David, P., et al.: The cost278 pan-european broadcast news database. In: Proceedings of LREC (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Abou-Zleikha, M., Tan, ZH., Christensen, M.G., Jensen, S.H. (2014). Utilising Tree-Based Ensemble Learning for Speaker Segmentation. In: Iliadis, L., Maglogiannis, I., Papadopoulos, H. (eds) Artificial Intelligence Applications and Innovations. AIAI 2014. IFIP Advances in Information and Communication Technology, vol 436. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44654-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-44654-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44653-9
Online ISBN: 978-3-662-44654-6
eBook Packages: Computer ScienceComputer Science (R0)