On classification and segmentation of massive audio data streams

Aggarwal, Charu C.

doi:10.1007/s10115-008-0174-y

On classification and segmentation of massive audio data streams

Regular Paper
Published: 16 October 2008

Volume 20, pages 137–156, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Charu C. Aggarwal¹

175 Accesses
17 Citations
3 Altmetric
Explore all metrics

Abstract

In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian mixture modeling (GMM), and it requires an iterative expectation maximization procedure for training, which cannot be implemented in real time. In many real applications (such as surveillance) it is desirable to perform the recognition process in online time, so that the system can be quickly adapted to new segments of the data. In many cases, it may also be desirable to quickly create databases of training profiles for speakers of interest. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used GMM. This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective. We present experimental results illustrating the effectiveness and efficiency of the method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC (2007) A framework for classification and segmentation of massive audio data streams. ACM KDD Conference, pp 1013–1017
Aggarwal C, Han J, Wang J, Yu P (2003a) A framework for clustering evolving data streams. VLDB Conference, pp 81–92
Aggarwal C (2003b) A framework for diagnosing changes in evolving data streams. ACM SIGMOD Conference, pp 575–586
Assaleh KT, Mammone RJ (1994) Robust cepstral features for speaker identification. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing
Ellis D (URL) Semester course: music content analysis by machine learning. URL: http://www.ee.columbia.edu/-dpwe/muscontent/practical/GMMs.html
Gish H, Schmidt M, Mielke A (1994) A robust segmental method for text independent speaker identification. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing
Ming J, Stewart D, Vaseghi S (2005) Speaker identification in unknown noisy conditions—a universal compensation approach. In: Proceedings of the international conference on acoustics, speech, and signal processing
Nabney I (2001) Netlab: algorithms for pattern recognition. Advances in pattern recognition. Springer, Germany. URL: http://www.ncrg.aston.ac.uk/netlab/down.php
Oates T, Jensen D (1998) Large datasets lead to overly complex models: an explanation and a solution. In: KDD conference proceedings, pp 294–298
Prybocki M, Martin A (URL) NIST’s assessment of text independent speaker recognition performance. URL: http://www.nist.gov/speech/publications/index.html
Reynolds D et al (1995) Robust text independent speaker detection system using Gaussian mixture models. In: IEEE transactions on speech and audio processing, vol 3. No. 1
Reynolds D, Quateiri T, Dunn R (2000) Speaker verification using adapted Gaussian mixture models. Digital Signal Process 10: 42–54
Article Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
MATH Google Scholar
Suhadi S, Stan S, Fingschiedt T, Beaugeant C (2003) An evaluation of VTS and IMM for speaker recognition in noise. Eurospeech, pp 1669–1672
Williams C, Utans J (1996) Model complexity, NIPS Workshop. URL: http://www.ncrg.aston.ac.uk/nips96/
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases, ACM SIGMOD conference, pp 103–114

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C. On classification and segmentation of massive audio data streams. Knowl Inf Syst 20, 137–156 (2009). https://doi.org/10.1007/s10115-008-0174-y

Download citation

Received: 01 December 2007
Revised: 21 May 2008
Accepted: 22 August 2008
Published: 16 October 2008
Issue Date: August 2009
DOI: https://doi.org/10.1007/s10115-008-0174-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On classification and segmentation of massive audio data streams

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

A survey of methods for time series change point detection

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On classification and segmentation of massive audio data streams

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

A survey of methods for time series change point detection

A comprehensive survey on automatic speech recognition using neural networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation