KALAKA-3: a database for the assessment of spoken language recognition technology on YouTube audios
KALAKA-3 is a speech database specifically designed for the development and evaluation of Spoken Language Recognition (SLR) systems. The database provides TV broadcast speech for training, and audio data extracted from YouTube videos for tuning and testing. The database was created to support the Albayzin 2012 Language Recognition Evaluation (LRE), which featured two language recognition tasks, both dealing with European languages. The first one involved six target languages (Basque, Catalan, English, Galician, Portuguese and Spanish) for which there was plenty of training data, whereas the second one involved four target languages (French, German, Greek and Italian) for which no training data was provided. This second task tried to simulate the use case of low resource languages. Two separate sets of YouTube audio files were provided to test the performance of language recognition systems on both tasks. To allow open-set tests, these datasets included speech in 11 additional (Out-Of-Set) European languages. In this paper, we first discuss the design issues considered when creating the database and describe the data collection procedure. Then, we present the results attained in the Albayzin 2012 LRE, along with the performance of state-of-the-art systems on the four evaluation tracks defined on the database. Both series of results demonstrate the usefulness of KALAKA-3 as a challenging benchmark for the advancement of SLR technology. As far as we know, this is the first database specifically designed to benchmark SLR technology on YouTube audios.
KeywordsSpoken language recognition YouTube audio Broadcast speech European languages Low-resource languages
- Bertoldi, N., & Federico, M. (2003). Cross-language spoken document retrieval on the TREC SDR collection. In Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science (Vol. 2785/2003, pp. 476–481). New York: Springer.Google Scholar
- Brümmer, N (2008). FoCal: Toolkit for evaluation, fusion and calibration of statistical pattern recognizers. https://sites.google.com/site/nikobrummer/focal.
- Brümmer, N., & van Leeuwen, D. (2006). On calibration of language recognition scores. In Proceedings of Odyssey: The speaker and language recognition workshop, pp. 1–8.Google Scholar
- D’Haro, L.F., Glembek, O., Plchot, O., Matejka, P., Soufifar, M., Cordoba, R., et al. (2012). Phonotactic language recognition using i-vectors and phoneme posteriorgram counts. In Interspeech 2012, Portland (OR), USA.Google Scholar
- D’Haro, L. F., de Córdoba, R., Caraballo, M. A., & Pardo, J. M. (2013). Low-resource language recognition using a fusion of phoneme posteriorgram counts, acoustic and glottal-based I-vectors. In Proceedings of ICASSP (pp. 6852–6856). Canada: Vancouver.Google Scholar
- D’Haro, L. F., de Córdoba, R., Palacios, C. S., & Echeverry, J. D. (2014a). Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In Proceedings of ICASSP (pp. 5342–5346). Italy: Florence.Google Scholar
- D’Haro, L.F., Córdoba, R., Salamea, C., & Ferreiros, J. (2014b). Language recognition using phonotactic-based shifted delta coefficients and multiple phone recognizers. In Proceedings of interspeech, Singapore, pp. 3042–3046.Google Scholar
- Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L. J., & Bordel, G. (2012). On the use of log-likelihood ratios as features in spoken language recognition. In IEEE workshop on spoken language technology (SLT), Miami, Florida, USA.Google Scholar
- Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear.
- Ma, B., Guan, C., Li, H., & Lee, C. H. (2002). Multilingual speech recognition with language identification. In Proceedings of ICSLP (Interspeech), pp 505–508.Google Scholar
- Martin, A. F., Greenberg, C. S., Howard, J. M., Doddington, G. R., & Godfrey, J. J. (2014). NIST language recognition evaluation past and future. In Proceedings of Odyssey: The speaker and language recognition workshop (pp. 145–151). Finland: Joensuu.Google Scholar
- Martinez, D., Plchot, O., Burget, L., Glembek O, & Matejka, P. (2011). Language recognition in iVectors space. In Proceedings of interspeech, pp 861–864.Google Scholar
- Martínez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based Prosodic system for language identification. In Proceedings of ICASSP, Japan, pp 4861–4864.Google Scholar
- Matejka, P., Schwarz, P., Cernocky, J., & Chytil, P. (2005). Phonotactic language identification using high quality phoneme recognition. In Proceedings of interspeech (pp. 2237–2241). Portugal: Lisboa.Google Scholar
- Penagarikano, M., Varona, A., Rodriguez-Fuentes, L. J., & Bordel, G. (2011a). Dimensionality reduction for using high-order n-grams in SVM-based phonotactic language recognition. In Proceedings of interspeech 2011 (pp. 853–856). Italy: Florence.Google Scholar
- Penagarikano, M., Varona, A., Rodríguez-Fuentes, L. J., & Bordel, G. (2011b). A dynamic approach to the selection of high-order n-grams in phonotactic language recognition. In Proceedings of ICASSP (pp. 4412–4415). Prague: Czech Republic.Google Scholar
- Richardson, F., & Campbell, W. (2008). Language recognition with discriminative keyword selection. In Proceedings of ICASSP, pp 4145–4148.Google Scholar
- Rodriguez-Fuentes, L.J., Penagarikano, M., Bordel, G., & Varona, A. (2010a). The Albayzin 2008 language recognition evaluation. In Proceedings of Odyssey: The speaker and language recognition workshop, pp 172–179.Google Scholar
- Rodriguez-Fuentes, L. J., Penagarikano, M., Bordel, G., Varona, A., & Diez, M. (2010b). KALAKA: A TV broadcast speech database for the evaluation of language recognition systems. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010) (pp. 1678–1685). Malta: Valleta.Google Scholar
- Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2011). The Albayzin 2010 language recognition evaluation. In Proceedings of interspeech, pp. 1529–1532.Google Scholar
- Rodriguez-Fuentes, L. J., Brümmer, N., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012a). The Albayzin 2012 language recognition evaluation plan (Albayzin 2012 LRE). URL: http://iberspeech2012.ii.uam.es/images/PDFs/albayzin_lre12_evalplan_v1.3_springer
- Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2012b). KALAKA-2: a TV broadcast speech database for the recognition of Iberian languages in clean and noisy environments. In Proceedings of the LREC (pp. 99–105). Turkey: Istanbul.Google Scholar
- Rodriguez-Fuentes, L. J., Varona, A., Diez, M., Penagarikano, M., & Bordel, G. (2012c). Evaluation of spoken language recognition technology using broadcast speech: Performance and challenges. In: Odyssey 2012: The speaker and language recognition workshop, Singapore.Google Scholar
- Rodriguez-Fuentes, L. J., Penagarikano, M., Varona, A., Diez, M., & Bordel, G. (2013). The Albayzin 2012 language recognition evaluation. In Proceedings of interspeech, pp 1497–1501.Google Scholar
- Schwarz, P. (2008) Phoneme recognition based on long temporal context. PhD thesis, Faculty of Information Technology, Brno University of Technology. http://www.fit.vutbr.cz/. Brno, Czech Republic.
- Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of interspeech, pp. 257–286.Google Scholar
- Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, J. R. (2002). Approaches to language identification using Gaussian mixture models and shifted Delta Cepstral features. In Proceedings of ICSLP (Interspeech), pp. 89–92.Google Scholar
- Waibel, A., Geutner, P., Tomoyiko, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1181–1190.Google Scholar
- Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Lui, X., et al. (2006). The HTK book (for HTK Version 3.4). Cambridge, UK: Entropic, Ltd.Google Scholar
- Zue, V. W., & Glass, J. R. (2000). Conversational interfaces: Advances and challenges. Proceedings of the IEEE, Special Issue on Spoken Language Processing, 88(8), 1166–1180.Google Scholar