Skip to main content

SVM-Based Detection of Misannotated Words in Read Speech Corpora

  • Conference paper
Text, Speech, and Dialogue (TSD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:


Automatic detection of misannotated words in single-speaker read-speech corpora is investigated in this paper. Support vector machine (SVM) classifier was proposed to detect the misannotated words. Its performance was evaluated with respect to various word-level feature sets. The SVM classifier was shown to perform very well with both high precision and recall scores and with F1 measure being almost 88%. This is a statistically significant improvement over a traditionally used outlier-based detection method.

The work has been supported by the Technology Agency of the Czech Republic, project No. TA01030476, and by the European Regional Development Fund (ERDF), project “New Technologies for Information Society” (NTIS), European Centre of Excellence, ED1.1.00/02.0090. The access to the MetaCentrum clusters provided under the programme LM2010005 is highly appreciated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 456–463. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  2. Matoušek, J., Romportl, J.: Recording and Annotation of Speech Corpus for Czech Unit Selection Speech Synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  3. Adell, J., Agüero, P.D., Bonafonte, A.: Database pruning for unsupervised building of text-to-speech voices. In: Proc. ICASSP, Toulouse, France, pp. 889–892 (2006)

    Google Scholar 

  4. Tachibana, R., Nagano, T., Kurata, G., Nishimura, M., Babaguchi, N.: Preliminary experiments toward automatic generation of new TTS voices from recorded speech alone. In: Proc. INTERSPEECH, Antwerp, Belgium, pp. 1917–1920 (2007)

    Google Scholar 

  5. Wei, S., Hu, G., Hu, Y., Wang, R.H.: A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Commun. 51(10), 896–905 (2009)

    Article  Google Scholar 

  6. Kominek, J., Black, A.: Impact of durational outlier removal from unit selection catalogs. In: Proc. SSW, Pittsburgh, USA, pp. 155–160 (2004)

    Google Scholar 

  7. Lu, H., Wei, S., Dai, L., Wang, R.H.: Automatic error detection for unit selection speech synthesis using log likelihood ratio based SVM classifier. In: Proc. INTERSPEECH, Makuhari, Japan, pp. 162–165 (2010)

    Google Scholar 

  8. Wang, W.Y., Georgila, K.: Automatic detection of unnatural word-level segments in unit-selection speech synthesis. In: Proc. ASRU, Hawaii, USA, pp. 289–294 (2011)

    Google Scholar 

  9. Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proc. INTERSPEECH, Makuhari, Japan, pp. 174–177 (2010)

    Google Scholar 

  10. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: HTK Book (for HTK Version 3.4). The Cambridge University, Cambridge (2006)

    Google Scholar 

  11. Matoušek, J., Tihelka, D., Psutka, J.V.: Experiments with Automatic Segmentation for Czech Speech Synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proc. INTERSPEECH, Brisbane, Australia, pp. 1626–1629 (2008)

    Google Scholar 

  13. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998)

    Article  Google Scholar 

  14. Cortes, C., Vapnik, V.: Support-vector networks. Machine Leaming 20(3), 273–279 (1995)

    MATH  Google Scholar 

  15. Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proc. Interspeech, Lyon, France (2013)

    Google Scholar 

  16. Romportl, J., Kala, J.: Prosody modelling in Czech text-to-speech synthesis. In: Proc. SSW, Bonn, Germany, pp. 200–205 (2007)

    Google Scholar 

  17. Taylor, P., Caley, R., Black, A., King, S.: Edinburgh speech tools library: System documentation (1999),

  18. Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, V.M.B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perror, M.: Édouard Duchesnay: Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  19. Přibil, J., Přibilová, A.: Comparison of spectral and prosodic parameters of male and female emotional speech in Czech and Slovak. In: Proc. ICASSP, Prague, Czech Republic, pp. 4720–4723 (2011)

    Google Scholar 

  20. Ircing, P., Psutka, J., Psutka, J.V.: Using morphological information for robust language modeling in Czech ASR system. IEEE Trans. Audio Speech Lang. Process. 17, 840–847 (2009)

    Article  Google Scholar 

  21. Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., Ircing, P.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 10 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Matoušek, J., Tihelka, D. (2013). SVM-Based Detection of Misannotated Words in Read Speech Corpora. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40584-6

  • Online ISBN: 978-3-642-40585-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics