Skip to main content
Log in

Collecting and evaluating speech recognition corpora for 11 South African languages

  • Original paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 h of speech per language are sufficient for the purposes of acceptable phone-based recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abdillahi, N., Nocera, P., & Bonastre, J.-F. (2006). Automatic transcription of Somali language. In Proceedings of Interspeech (pp. 289–292). Pittsburgh, PA, USA.

  • Badenhorst, J. (2009). Data sufficiency analysis for automatic speech recognition. Master’s thesis, Potchefstroom Campus, North-West University.

  • Badenhorst, J., & Davel, M. (2008). Data requirements for speaker independent acoustic models. In Proceedings of the 19th annual symposium of the pattern recognition association of South Africa (pp. 147–152). Cape Town, South Africa.

  • Barnard, E., Cloete, L., & Patel, H. (2003). Language and technology literacy barriers to accessing government services. Lecture Notes in Computer Science, 2739, 37–42.

    Article  Google Scholar 

  • Barnard, E., Davel, M., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. In Proceedings of interspeech (pp. 2847–2850). Brighton, UK.

  • Byrne, W., Beyerlein, P., Huerta, J. M., Khudanpur, S., Marthi, B., Morgan, J., Peterek, N., Picone, J., Vergyri1, D., & Wang, W. (2000). Towards language independent acoustic modeling. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International Conference on, vol. 2 (pp. 1029–1032). Istanbul, Turkey.

  • Cohen, M., Giangola, J., & Balogh, J. (2004). Voice user interface design. Boston:Addison-Wesley.

    Google Scholar 

  • Davel, M., & Barnard, E. (2004). The efficient generation of pronunciation dictionaries: Human factors during bootstrapping. In Proceedings of Interspeech (pp. 2797–2800). Jeju, Korea.

  • Davel, M., & Barnard, E. (2008). Pronunciation predication with default&refine. Computer Speech and Language, 22(4), 374–393.

    Article  Google Scholar 

  • Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). Waltham:Academic Press Inc.

    Google Scholar 

  • Kominek, J., & Black, A. W. (2006). Learning pronunciation dictionaries: Language complexity and word selection strategies. In Proceedings of the Human Language Technology Conference of the NAACL (pp. 232–239). New York City, USA: Association for Computational Linguistics.

  • Lehohla, P. (2003). Census 2001: Census in brief. Report no. 03-02-03. Online: http://www.statssa.gov.za/census01/html/CIB2001.pd.

  • Maskey, S., Black, A., & Tomokiyo, L. (2004). Boostrapping phonetic lexicons for new languages. In Proceedings of Interspeech (pp. 69–72). Jeju, Korea.

  • Meraka-Institute (2009). Lwazi ASR corpus. Online: http://www.meraka.org.za/lwaz.

  • Morales, N., Tejedor, J., Garrido, J., Colas, J., & Toledano, D. (2008). STC-TIMIT: Generation of a single-channel telephone corpus. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) (pp. 391–395). Marrakech, Morocco.

  • Nagroski, A., Boves, L., & Steeneken, H. (2003). In search of optimal data selection for training of automatic speech recognition systems. Automatic Speech Recognition and Understanding, 2003. ASRU ’03. 2003 IEEE Workshop on (pp. 67–72).

  • Nasfors, P. (2007). Efficient voice information services for developing countries. Master’s thesis, Department of Information Technology, Uppsala University, Sweden.

  • Niesler, T. (2007). Language-dependent state clustering for multilingual acoustic modeling. Speech Communication, 49, 453–463.

    Article  Google Scholar 

  • Patel, N., Chittamuru, D., Jain, A., Dave, P., & Parikh, T. S. (2010). Avaaj Otalo a field study of an interactive voice forum for small farmers in rural India. In Proceedings of the 28th International Conference on Human Factors in Computing systems (pp. 733–742). Atlanta, GA, USA: ACM.

  • Riccardi, G., & Hakkani-Tur, D. (2003). Active and unsupervised learning for automatic speech recognition. In Proceedings of Eurospeech (pp. 1825–1828). Geneva, Switzerland.

  • Roux, J., Botha, E., & du Preez, J. (2000). Developing a multilingual telephone based information system in African languages. In Second International Language Resources and Evaluation Conference (pp. 975–980). Athens, Greece.

  • Schultz, T., & Waibel, A. (2001). Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Communication, 35, 31–51.

    Article  Google Scholar 

  • Schuurmans, D. (1997). Characterizing rational versus exponential learning curve. Journal of Computer and System Science, 55(1), 140–160.

    Article  Google Scholar 

  • Seid, H., & Gambäck, B. (2005). A speaker independent continuous speech recognizer for Amharic. In Proceedings of Interspeech (pp. 3349–3352). Lisboa, Portugal.

  • Sharma, A., Plauché, M., Kuun, C., & Barnard, E. (2009). HIV health information access using spoken dialogue systems: Touchtone vs. speech. In IEEE International Conference on Information and Communications Technologies and Development ’09 (ICTD 09) (pp. 95–107). Doha, Qatar.

  • Sherwani, J., Ali, N., Mirza, S., Fatma, A., Memon, Y., Karim, M., Tongia, R., & Rosenfeld, R. (2007). Healthline: Speech-based access to health information by low-literate users. In Information and Communication Technologies and Development, International Conference on (pp. 131–139). Bangalore, India.

  • Sherwani, J., Palijo, S., Mirza, S., Ahmed, T., Ali, N., & Rosenfeld, R. (2009). Speech vs. touch-tone: Telephony interfaces for information access by low literate users. In IEEE International Conference on Information and Communications Technologies and Development ’09 (ICTD 09) (pp. 447–457). Doha, Qatar.

  • Tucker, R., & Shalonova, K. (2004). The local language speech technology initiative. In Proceedings of SCALLA Conference. Nepal.

  • van Heerden, C., Barnard, E., & Davel, M. (2009). Basic speech recognition for spoken dialogues. In Proceedings of Interspeech (pp. 3003–3006). Brighton, UK.

  • Viterbi, A. (1967). Error bounds for convolutional codes and a asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.

    Article  Google Scholar 

  • Wheatley, B., Kondo, K., Anderson, W., & Muthusumy, Y. (1994). An evaluation of cross-language adaptation for rapid HMM development in a new language. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (pp. 237–240). Adelaide, SA, Australia.

  • Wu, Y., Zhang, R., & Rudnicky, A. (2007). Data selection for speech recognition. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 562–565).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaco Badenhorst.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Badenhorst, J., van Heerden, C., Davel, M. et al. Collecting and evaluating speech recognition corpora for 11 South African languages. Lang Resources & Evaluation 45, 289–309 (2011). https://doi.org/10.1007/s10579-011-9152-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9152-1

Keywords

Navigation