Data-Driven Approaches to Objective Evaluation of Phoneme Alignment Systems

  • Ladan Baghai-Ravary
  • Greg Kochanski
  • John Coleman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6562)


This paper presents techniques for objective characterisation of Automatic Speech-to-Phoneme Alignment (ASPA) systems, without the need for human-generated labels to act as a benchmark. As well as being immune to the effects of human variability, these techniques yield diagnostic information which can be helpful in the development of new alignment systems, ensuring that the resulting labels are as consistent as possible. To illustrate this, a total of 48 ASPA systems are used, including three front-end processors. For each processor, the number of states in each phoneme model, and of Gaussian distributions in each state mixture, are adjusted to generate a broad variety of systems. The results are compared using a statistical measure and a model-based Bayesian Monte-Carlo approach. The most consistent alignment system is identified, and is (as expected) in close agreement with typical “baseline” systems used in ASR research.


Phonetic alignment label accuracy phoneme detectivity objective evaluation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baghai-Ravary, L.: Multi-dimensional Adaptive Signal Processing, with Application to Speech Recognition, Speech Coding and Image Compression. University of Sheffield PhD. Thesis (1995)Google Scholar
  2. 2.
    Beet, S.W., Gransden, I.R.: Interfacing an Auditory Model to a Parametric Speech Recogniser. Proc. Insititute of Acoustics 14(6), 321–328 (1992)Google Scholar
  3. 3.
    Chen, L., Liu, Y., Maia, E., Harper, M.: Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus. In: 4th International Conference on Language Resources and Evaluation (LREC), ELRA (2004)Google Scholar
  4. 4.
    Hutchinson, W., Knopoff, L.: The Acoustic Component of Western Consonance. Interface 7, 1–29 (1978)CrossRefGoogle Scholar
  5. 5.
    Kochanski, G., et al.: Loudness Predicts Prominence; Fundamental Frequency Lends Little. J. Acoustical Society of America 11(2), 1038–1054 (2005)CrossRefGoogle Scholar
  6. 6.
    Kochanski, G., Orphanidou, C.: Testing the Ecological Validity of Repetitive Speech. In: Proc. International Congress of Phonetic Sciences (ICPhS 2007), IPA (2007),
  7. 7.
    Kochanski, G., Rosner, B.S.: Bootstrap Markov Chain Monte Carlo and Optimal Solutions to The Law of Categorical Judgement (Corrected). Submitted to Behavior Research Methods (2010),
  8. 8.
    Lander, T.: CSLU Labeling Guide, Center for Spoken Language Understanding, Oregon Graduate Institute (1997)Google Scholar
  9. 9.
    Ljolje, A., Riley, M.D.: Automatic Segmentation of Speech for TTS. In: Proc 3rd European Conference on Speech Communication and Technology (EUROSPEECH 1993), ESCA, pp. 1445–1448 (1993)Google Scholar
  10. 10.
    Moore, B.C.J., Glasberg, B.R.: Suggested Formulae for Calculating Auditory-Filter Bandwidths and Excitation Patterns. J. Acoustical Society of America 74(3), 750–753 (1983)CrossRefGoogle Scholar
  11. 11.
    Sebestyen, G.S.: Decision-Making Processes in Pattern Recognition. ACM Monograph Series, pp. 40–47. MacMillan, Basingstoke (1962)Google Scholar
  12. 12.
    SoX Sound eXchange manual (2009),
  13. 13.
    Young, S.J., et al.: The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department (2009),

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ladan Baghai-Ravary
    • 1
  • Greg Kochanski
    • 1
  • John Coleman
    • 1
  1. 1.Phonetics LaboratoryOxford UniversityOxfordUK

Personalised recommendations