Spoken Document Retrieval: Sub-sequence DTW Framework and Variants

Khatwani, Akshay; Pawar, Komala; Hegde, Sushma; Rao, Sudha; Seshasayee, Adithya; Ramasubramanian, V.

doi:10.1007/978-3-319-26832-3_29

Akshay Khatwani¹⁶,
Komala Pawar¹⁶,
Sushma Hegde¹⁶,
Sudha Rao¹⁶,
Adithya Seshasayee¹⁷ &
…
V. Ramasubramanian¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9468))

Included in the following conference series:

International Conference on Mining Intelligence and Knowledge Exploration

1805 Accesses

Abstract

We address the problem of spoken document retrieval (alternately termed content-based audio-search and retrieval), which involves searching a large spoken document or database for a specific spoken query. We formulate the search within the sub-sequence DTW (SS-DTW) framework proposed earlier in literature, adapted here to work on acoustic feature representation of the database and spoken query term. Further, we propose several variants within this framework, such as (i) path-length based score normalization, (ii) clustered quantization of acoustic feature vectors for fast search and retrieval with invariant performances and, (iii) phonetic representation of the database and spoken query term, derived from ground-truth annotation as well as HMM based continuous phoneme recognition. We characterize the performance of the proposed framework, algorithms and variants in terms of ROC curves, EER and time-complexity and present results using the TIMIT database with annotated spoken sentences from 400 speakers.

A. Seshasayee—Author carried out this work as Research Associate at PESIT-BSC, Bangalore.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Divakaran, A.: Multimedia Content Analysis: Theory and Applications. Springer, New York (2009)
Google Scholar
Muller, M.: Dynamic time warping. In: Muller, M. (ed.) Information Retrieval for Music and Motion, Chap. 4, pp. 69–84. Springer, Heidelberg (2007)
Google Scholar
Fisher, W.M., Doddington, G.R., George, R., Goudie-Marshall, K.M.: The DARPA speech recognition research database: specifications and status. In: Proceedings of DARPA Workshop on Speech Recognition, pp. 93–99 (1986). https://catalog.ldc.upenn.edu/LDC93S1
Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle River (1993)
Google Scholar
Rosenberg, A.E., Bimbot, F., Parthasarathy, S.: Overview of speaker recognition. In: Benesty, J., Sondhi, M.M., Huang, Y. (eds.) Handbook of Speech Processing, Chap. 36, pp. 725–741. Springer, Berlin (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

PES Institute of Technology - Bangalore South Campus (PESIT-BSC), Bangalore, India
Akshay Khatwani, Komala Pawar, Sushma Hegde, Sudha Rao & V. Ramasubramanian
University of California, San Diego, USA
Adithya Seshasayee

Authors

Akshay Khatwani
View author publications
You can also search for this author in PubMed Google Scholar
Komala Pawar
View author publications
You can also search for this author in PubMed Google Scholar
Sushma Hegde
View author publications
You can also search for this author in PubMed Google Scholar
Sudha Rao
View author publications
You can also search for this author in PubMed Google Scholar
Adithya Seshasayee
View author publications
You can also search for this author in PubMed Google Scholar
V. Ramasubramanian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Ramasubramanian .

Editor information

Editors and Affiliations

Norwegian Univ. of Science & Technology, Trondheim, Norway
Rajendra Prasath
Intl Inst of Info Tech Hyderabad, Hyderabad, India
Anil Kumar Vuppala
V.H.N.S.N.College (Autonomous), Virudhunagar, Tamil Nadu, India
T. Kathirvalavakumar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khatwani, A., Pawar, K., Hegde, S., Rao, S., Seshasayee, A., Ramasubramanian, V. (2015). Spoken Document Retrieval: Sub-sequence DTW Framework and Variants. In: Prasath, R., Vuppala, A., Kathirvalavakumar, T. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2015. Lecture Notes in Computer Science(), vol 9468. Springer, Cham. https://doi.org/10.1007/978-3-319-26832-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-26832-3_29
Published: 03 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26831-6
Online ISBN: 978-3-319-26832-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics