Utilising Tree-Based Ensemble Learning for Speaker Segmentation

Abou-Zleikha, Mohamed; Tan, Zheng-Hua; Christensen, Mads Græsbøll; Jensen, Søren Holdt

doi:10.1007/978-3-662-44654-6_5

Mohamed Abou-Zleikha⁴,
Zheng-Hua Tan⁴,
Mads Græsbøll Christensen⁵ &
…
Søren Holdt Jensen⁴

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 436))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

1803 Accesses

Abstract

In audio and speech processing, accurate detection of the changing points between multiple speakers in speech segments is an important stage for several applications such as speaker identification and tracking. Bayesian Information Criteria (BIC)-based approaches are the most traditionally used ones as they proved to be very effective for such task. The main criticism levelled against BIC-based approaches is the use of a penalty parameter in the BIC function. The use of this parameters consequently means that a fine tuning is required for each variation of the acoustic conditions. When tuned for a certain condition, the model becomes biased to the data used for training limiting the model’s generalisation ability.

In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each tree is trained using a sampled version of the speech segment. During the tree construction process, a set of randomly selected points in the input sequence is examined as potential segmentation points. The point that yields the highest ΔBIC is chosen and the same process is repeated for the resultant left and right segments. The tree is constructed where each node corresponds to the highest ΔBIC with the associated point index. After building the forest and using all trees, the accumulated ΔBIC for each point is calculated and the positions of the local maximums are considered as speaker changing points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F ₁ compared with the standard ΔBIC with optimally tuned penalty parameter.

Download to read the full chapter text

Chapter PDF

Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI

Optimized Active Learning Strategy for Audiovisual Speaker Recognition

Probabilistic Prediction in Multiclass Classification Derived for Flexible Text-Prompted Speaker Verification

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Abou-Zleikha, M., , Tan, Z.H., Christensen, M.G., Jensen, S.H.: Non-linguistic vocal event detection and localisation using online random forest. In: Proceedings of 37th International Convention of Information and Communication Technology (MIPRO). IEEE (2014)
Google Scholar
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Signal Processing Letters 11(8), 649–651 (2004)
Article Google Scholar
Anguera Miro, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 356–370 (2012)
Article Google Scholar
Ben, M., Betser, M., Bimbot, F., Gravier, G.: Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted gmms. In: Proceedings of ICSLP (2004)
Google Scholar
Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.: A speaker tracking system based on speaker turn detection for nist evaluation. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II1177–II1180. IEEE (2000)
Google Scholar
Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
Google Scholar
Chen, S., Gopalakrishnan, P.: Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998)
Google Scholar
Cheng, S.S., Wang, H.M., Fu, H.C.: Bic-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Transactions on Audio, Speech, and Language Processing 18(1), 141–157 (2010)
Article Google Scholar
Grašič, M., Kos, M., Kačič, Z.: Online speaker segmentation and clustering using cross-likelihood ratio calculation with reference criterion selection. IET signal processing 4(6), 673–685 (2010)
Article Google Scholar
Kotti, M., Benetos, E., Kotropoulos, C.: Automatic speaker change detection with the bayesian information criterion using mpeg-7 features and a fusion scheme. In: IEEE International Symposium on Circuits and Systems, p. 4. IEEE (2006)
Google Scholar
Kumar, A., Dighe, P., Singh, R., Chaudhuri, S., Raj, B.: Audio event detection from acoustic unit occurrence patterns. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 489–492 (2012)
Google Scholar
Lamel, L.F., Kassel, R.H., Seneff, S.: Speech database development: Design and analysis of the acoustic-phonetic corpus. In: Speech Input/Output Assessment and Speech Databases (1989)
Google Scholar
Li, R., Schultz, T., Jin, Q.: Improving speaker segmentation via speaker identification and text segmentation. In: Proceedings of INTERSPEECH 2009 (2009)
Google Scholar
Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II–5. IEEE (2003)
Google Scholar
Mohammadi, S.H., Sameti, H., Langarani, M.S.E., Tavanaei, A.: Knndist: A non-parametric distance measure for speaker segmentation. In: Proceedings of INTERSPEECH (2012)
Google Scholar
Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 413–416. IEEE (2001)
Google Scholar
Moschou, V., Kotti, M., Benetos, E., Kotropoulos, C.: Systematic comparison of bic-based speaker segmentation systems. In: Proceedings of IEEE 9th Workshop on Multimedia Signal Processing, pp. 66–69. IEEE (2007)
Google Scholar
Rong, J., Li, G., Chen, Y.P.P.: Acoustic feature selection for automatic emotion recognition from speech. Information processing & management 45(3), 315–328 (2009)
Article Google Scholar
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., et al.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of INTERSPEECH, vol. 2007, pp. 1–4 (2007)
Google Scholar
Tritschler, A., Gopinath, R.A.: Improved speaker segmentation and segments clustering using the bayesian information criterion. In: Proceedings of Eurospeech, vol. 99, pp. 679–682 (1999)
Google Scholar
Vandecatseye, A., Martens, J.P., Neto, J.P., Meinedo, H., Garcia-Mateo, C., Dieguez-Tirado, J., Mihelic, F., Zibert, J., Nouza, J., David, P., et al.: The cost278 pan-european broadcast news database. In: Proceedings of LREC (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Systems, Aalborg University, Denmark
Mohamed Abou-Zleikha, Zheng-Hua Tan & Søren Holdt Jensen
Audio Analysis Lab, ad:mt, Aalborg University, Denmark
Mads Græsbøll Christensen

Authors

Mohamed Abou-Zleikha
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Hua Tan
View author publications
You can also search for this author in PubMed Google Scholar
Mads Græsbøll Christensen
View author publications
You can also search for this author in PubMed Google Scholar
Søren Holdt Jensen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Forestry and Management of the Environment, Democritus University of Thrace, Pandazidou 193, 68200, Orestiada, Greece
Lazaros Iliadis
Department of Digital Systems, University of Piraeus, 80, Karaoli and Dimitriou Str., 18534, Piraeus, Greece
Ilias Maglogiannis
Department of Computer Science and Engineering, Frederick University, 7 Yianni Frederickou Str., Pallouriotissa, 1036, Nicosia, Cyprus
Harris Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abou-Zleikha, M., Tan, ZH., Christensen, M.G., Jensen, S.H. (2014). Utilising Tree-Based Ensemble Learning for Speaker Segmentation. In: Iliadis, L., Maglogiannis, I., Papadopoulos, H. (eds) Artificial Intelligence Applications and Innovations. AIAI 2014. IFIP Advances in Information and Communication Technology, vol 436. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44654-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-44654-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44653-9
Online ISBN: 978-3-662-44654-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Utilising Tree-Based Ensemble Learning for Speaker Segmentation

Abstract

Chapter PDF

Similar content being viewed by others

Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI

Optimized Active Learning Strategy for Audiovisual Speaker Recognition

Probabilistic Prediction in Multiclass Classification Derived for Flexible Text-Prompted Speaker Verification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Utilising Tree-Based Ensemble Learning for Speaker Segmentation

Abstract

Chapter PDF

Similar content being viewed by others

Speaker Detection in Audio Stream via Probabilistic Prediction Using Generalized GEBI

Optimized Active Learning Strategy for Audiovisual Speaker Recognition

Probabilistic Prediction in Multiclass Classification Derived for Flexible Text-Prompted Speaker Verification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation