Language Resources and Evaluation

, Volume 50, Issue 2, pp 221–243 | Cite as

KALAKA-3: a database for the assessment of spoken language recognition technology on YouTube audios

  • Luis Javier Rodríguez-Fuentes
  • Mikel Penagarikano
  • Amparo Varona
  • Mireia Diez
  • Germán Bordel
Original Paper

Abstract

KALAKA-3 is a speech database specifically designed for the development and evaluation of Spoken Language Recognition (SLR) systems. The database provides TV broadcast speech for training, and audio data extracted from YouTube videos for tuning and testing. The database was created to support the Albayzin 2012 Language Recognition Evaluation (LRE), which featured two language recognition tasks, both dealing with European languages. The first one involved six target languages (Basque, Catalan, English, Galician, Portuguese and Spanish) for which there was plenty of training data, whereas the second one involved four target languages (French, German, Greek and Italian) for which no training data was provided. This second task tried to simulate the use case of low resource languages. Two separate sets of YouTube audio files were provided to test the performance of language recognition systems on both tasks. To allow open-set tests, these datasets included speech in 11 additional (Out-Of-Set) European languages. In this paper, we first discuss the design issues considered when creating the database and describe the data collection procedure. Then, we present the results attained in the Albayzin 2012 LRE, along with the performance of state-of-the-art systems on the four evaluation tracks defined on the database. Both series of results demonstrate the usefulness of KALAKA-3 as a challenging benchmark for the advancement of SLR technology. As far as we know, this is the first database specifically designed to benchmark SLR technology on YouTube audios.

Keywords

Spoken language recognition YouTube audio Broadcast speech European languages Low-resource languages 

References

  1. Bertoldi, N., & Federico, M. (2003). Cross-language spoken document retrieval on the TREC SDR collection. In Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science (Vol. 2785/2003, pp. 476–481). New York: Springer.Google Scholar
  2. Brümmer, N (2008). FoCal: Toolkit for evaluation, fusion and calibration of statistical pattern recognizers. https://sites.google.com/site/nikobrummer/focal.
  3. Brümmer, N., & van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The speaker and language recognition workshop, pp. 1–8.Google Scholar
  4. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Outlet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.CrossRefGoogle Scholar
  5. D’Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., et al. (2012). Phonotactic language recognition using i-vectors and phoneme posteriorgram counts. In Interspeech 2012, Portland (OR), USA.Google Scholar
  6. D’Haro, L. F., de Córdoba, R., Caraballo, M. A., & Pardo, J. M. (2013). Low-resource language recognition using a fusion of phoneme posteriorgram counts, acoustic and glottal-based I-vectors. In Proceedings of ICASSP (pp. 6852–6856). Canada: Vancouver.Google Scholar
  7. D’Haro, L. F., de Córdoba, R., Palacios, C. S., & Echeverry, J. D. (2014a). Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In Proceedings of ICASSP (pp. 5342–5346). Italy: Florence.Google Scholar
  8. D’Haro, L.F., Córdoba, R., Salamea, C., & Ferreiros, J. (2014b). Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers. In Proceedings of interspeech, Singapore, pp. 3042–3046.Google Scholar
  9. Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2012). On the use of log-likelihood ratios as features in spoken language recognition. In IEEE workshop on spoken language technology (SLT), Miami, Florida, USA.Google Scholar
  10. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear.
  11. Li, H., Ma, B., & Lee, C. H. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing, 15(1), 271–284.CrossRefGoogle Scholar
  12. Li, H., Ma, B., & Lee, K. A. (2013). Spoken language recognition: From fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.CrossRefGoogle Scholar
  13. Ma, B., Guan, C., Li, H., & Lee, C. H. (2002). Multilingual speech recognition with language identification. In Proceedings of ICSLP (Interspeech), pp 505–508.Google Scholar
  14. Martin, A. F., Greenberg, C. S., Howard, J. M., Doddington, G. R., & Godfrey, J. J. (2014). NIST language recognition evaluation past and future. In Proceedings of Odyssey: The speaker and language recognition workshop (pp. 145–151). Finland: Joensuu.Google Scholar
  15. Martinez, D., Plchot, O., Burget, L., Glembek O, & Matejka, P. (2011). Language recognition in iVectors space. In Proceedings of interspeech, pp 861–864.Google Scholar
  16. Martínez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based Prosodic system for language identification. In Proceedings of ICASSP, Japan, pp 4861–4864.Google Scholar
  17. Matejka, P., Schwarz, P., Cernocky, J., & Chytil, P. (2005). Phonotactic language identification using high quality phoneme recognition. In Proceedings of interspeech (pp. 2237–2241). Portugal: Lisboa.Google Scholar
  18. Penagarikano, M., Varona, A., Rodriguez-Fuentes, L. J., & Bordel, G. (2011a). Dimensionality reduction for using high-order n-grams in SVM-based phonotactic language recognition. In Proceedings of interspeech 2011 (pp. 853–856). Italy: Florence.Google Scholar
  19. Penagarikano, M., Varona, A., Rodríguez-Fuentes, L. J., & Bordel, G. (2011b). A dynamic approach to the selection of high-order n-grams in phonotactic language recognition. In Proceedings of ICASSP (pp. 4412–4415). Prague: Czech Republic.Google Scholar
  20. Penagarikano, M., Varona, A., Rodríguez-Fuentes, L. J., & Bordel, G. (2011c). Improved modeling of cross-decoder phone co-occurrences in SVM-based phonotactic language recognition. IEEE Transactions on Audio, Speech and Language Processing, 19(8), 2348–2363.CrossRefGoogle Scholar
  21. Richardson, F., & Campbell, W. (2008). Language recognition with discriminative keyword selection. In Proceedings of ICASSP, pp 4145–4148.Google Scholar
  22. Rodriguez-Fuentes, L.J., Penagarikano, M., Bordel, G., & Varona, A. (2010a). The Albayzin 2008 language recognition evaluation. In Proceedings of Odyssey: The speaker and language recognition workshop, pp 172–179.Google Scholar
  23. Rodriguez-Fuentes, L. J., Penagarikano, M., Bordel, G., Varona, A., & Diez, M. (2010b). KALAKA: A TV broadcast speech database for the evaluation of language recognition systems. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010) (pp. 1678–1685). Malta: Valleta.Google Scholar
  24. Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2011). The Albayzin 2010 language recognition evaluation. In Proceedings of interspeech, pp. 1529–1532.Google Scholar
  25. Rodriguez-Fuentes, L. J., Brümmer, N., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012a). The Albayzin 2012 language recognition evaluation plan (Albayzin 2012 LRE). URL: http://iberspeech2012.ii.uam.es/images/PDFs/albayzin_lre12_evalplan_v1.3_springer
  26. Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012b). KALAKA-2: a TV broadcast speech database for the recognition of Iberian languages in clean and noisy environments. In Proceedings of the LREC (pp. 99–105). Turkey: Istanbul.Google Scholar
  27. Rodriguez-Fuentes, L. J., Varona, A., Diez, M., Penagarikano, M., & Bordel, G. (2012c). Evaluation of spoken language recognition technology using broadcast speech: Performance and challenges. In: Odyssey 2012: The speaker and language recognition workshop, Singapore.Google Scholar
  28. Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2013). The Albayzin 2012 language recognition evaluation. In Proceedings of interspeech, pp 1497–1501.Google Scholar
  29. Schwarz, P. (2008) Phoneme recognition based on long temporal context. PhD thesis, Faculty of Information Technology, Brno University of Technology. http://www.fit.vutbr.cz/. Brno, Czech Republic.
  30. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of interspeech, pp. 257–286.Google Scholar
  31. Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, J. R. (2002). Approaches to language identification using Gaussian mixture models and shifted Delta Cepstral features. In Proceedings of ICSLP (Interspeech), pp. 89–92.Google Scholar
  32. Waibel, A., Geutner, P., Tomoyiko, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1181–1190.Google Scholar
  33. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Lui, X., et al. (2006). The HTK book (for HTK Version 3.4). Cambridge, UK: Entropic, Ltd.Google Scholar
  34. Zue, V. W., & Glass, J. R. (2000). Conversational interfaces: Advances and challenges. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1166–1180.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  1. 1.Grupo de Trabajo en Tecnologías Software (GTTS), Departamento de Electricidad y Electrónica, ZTF-FCTUniversity of the Basque Country UPV/EHULeioaSpain

Personalised recommendations