Cross-Language Acoustic Modeling for Macedonian Speech Technology Applications

  • Ivan KraljevskiEmail author
  • Guntram Strecha
  • Matthias Wolff
  • Oliver Jokisch
  • Slavcho Chungurski
  • Rüdiger Hoffmann
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 207)


This paper presents a cross-language development method for speech recognition and synthesis applications for Macedonian language. Unified system for speech recognition and synthesis trained on German language data was used for acoustic model bootstrapping and adaptation. Both knowledge-based and data-driven approaches for source and target language phoneme mapping were used for initial transcription and labeling of small amount of recorded speech. The recognition experiments on the source language acoustic model with target language dataset showed significant recognition performance degradation. Acceptable performance was achieved after Maximum a posteriori (MAP) model adaptation with limited amount of target language data, allowing suitable use for small to medium vocabulary speech recognition applications. The same unified system was used again to train new separate acoustic model for HMM based synthesis. Qualitative analysis showed, despite the low quality of the available recordings and sub-optimal phoneme mapping, that HMM synthesis produces perceptually good and intelligible synthetic speech.


speech recognition speech synthesis cross-language bootstrapping 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Vu, N.T., Kraus, F., Schultz, T.: Rapid building of an ASR system for Under-Resourced Languages based on Multilingual Unsupervised training. In: Interspeech 2011, Florence, Italy, August 28 (2011)Google Scholar
  2. 2.
    Schultz, T., Waibel, A.: Experiments on Cross-language Acoustic Modeling. In: Proceedings of the 7th European Conference on Speech Communication and Technology, Eurospeech 2001, Aalborg, Denmark, p. 2721 (2001)Google Scholar
  3. 3.
    Le, V.B., Besacier, L.: First steps in fast acoustic modeling for a new target language: application to Vietnamese. In: ICASSP 2005, Philadelphia, USA, March 19-23, vol. 1, pp. 821–824 (2005)Google Scholar
  4. 4.
    Martin, T., Sridharan, S.: Cross-language acoustic model refinement for the Indonesian language. In: International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 865–868 (March 2005)Google Scholar
  5. 5.
    Lööf, J., Gollan, C., Ney, H.: Cross-language Bootstrapping for Unsupervised Acoustic Model Training: Rapid Development of a Polish Speech Recognition System. In: Interspeech, pp. 88–91 (September 2009)Google Scholar
  6. 6.
    Le, V.B., Besacier, L., Schultz, T.: Acoustic-Phonetic Unit Similarities for Context Dependent Acoustic Model Portability. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006 (2006)Google Scholar
  7. 7.
    Chungurski, S., Kraljevski, I., Mihajlov, D., Arsenovski, S.: Concatenative speech synthesizers and speech corpus for Macedonian language. In: 30th International Conference on Information Technology Interfaces, Dubrovnik, Croatia, June 23-26, pp. 669–674 (2008)Google Scholar
  8. 8.
    Hoffmann, R., Eichner, M., Wolff, M.: Analysis of verbal and nonverbal acoustic signals with the Dresden UASR system. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 200–218. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Strecha, G., Wolff, M.: Speech synthesis using HMM based diphone inventory encoding for low-resource devices. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, pp. 5380–5383 (2011)Google Scholar
  10. 10.
    Bub, T., Schwinn, J.: VERBMOBIL: The Evolution of a Complex Large Speech-to-Speech Translation System. In: Int. Conf. on Spoken Language Processing, Philadelphia, PA, USA, vol. 4, pp. 2371–2374 (October 1996)Google Scholar
  11. 11.
    Gauvain, J.-L., Lee, C.-H.: Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994)CrossRefGoogle Scholar
  12. 12.
    Imai, S., Sumita, K., Furuichi, C.: Mel log spectrum approximation (MLSA) filter for speech synthesis. Trans. IECE J66-A, 122–129 (1983)Google Scholar
  13. 13.
    Tokuda, K., et al.: Speech parameter generation algorithms for HMM-based speech synthesis. In: ICASSP. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Istanbul, June 5-9, vol. III, pp. 1315–1318. IEEE Computer Society Press, Los Alamitos (2000)Google Scholar
  14. 14.
    Hoffmann, R., Hirschfeld, D., Jokisch, O., Kordon, U., Mixdorff, H., Mehnert, D.: Evaluation of a multilingual TTS system with respect to the prosodic quality. In: Proc. 14th Intern. Congress of Phonetic Sciences (ICPhS), San Francisco, USA, August 1-7, pp. 2307–2310 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ivan Kraljevski
    • 1
    Email author
  • Guntram Strecha
    • 1
  • Matthias Wolff
    • 2
  • Oliver Jokisch
    • 1
  • Slavcho Chungurski
    • 3
  • Rüdiger Hoffmann
    • 1
  1. 1.Chair for System Theory and Speech TechnologyTU DresdenDresdenGermany
  2. 2.Electronics and Information Technology InstituteBTU CottbusCottbusGermany
  3. 3.Faculty of InformaticsFON UniversitySkopjeMacedonia

Personalised recommendations