Skip to main content
Log in

Efficient data selection for ASR

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many languages found in the developing world fall into the resource-scarce category and due to this resource scarcity the deployment of ASR systems in the developing world is severely inhibited. One approach to assist with resource-scarce ASR system development, is to select “useful” training samples which could reduce the resources needed to collect new corpora. In this work, we propose a new data selection framework which can be used to design a speech recognition corpus. We show for limited data sets, independent of language and bandwidth, the most effective strategy for data selection is frequency-matched selection and that the widely-used maximum entropy methods generally produced the least promising results. In our model, the frequency-matched selection method corresponds to a logarithmic relationship between accuracy and corpus size; we also investigated other model relationships, and found that a hyperbolic relationship (as suggested from simple asymptotic arguments in learning theory) may lead to somewhat better performance under certain conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Barnard, E. (1994). A model for nonpolynomial decrease in error rate with increasing sample size. IEEE Transactions on Neural Networks, 5(6), 994–997.

    Article  Google Scholar 

  • Barnard, E., Davel, M., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. In Proceedings of INTERSPEECH, ISCA (pp. 2847–2850). Brighton, UK.

  • Erol, B., Cohen, J., Etoh, M., Hon, H. W., Luo, J., & Schalkwyk, J. (2009). Mobile media search. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP) (pp. 4897–4900). Taipei, Taiwan.

  • Fisher, W. M., Doddington, G. R., & Goudie-Marshall, K. M. (1986). The DARPA speech recognition research database: specifications and status. In Proceedings of the DARPA workshop on speech recognition (pp. 93–99).

  • Gillick, L., & Cox, S. J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP), Vol. 1 (pp. 532–535). Glasgow, Scotland.

  • Gouvêa, E., & Davel, M. H. (2011). Kullback-Leibler divergence-based ASR training data selection. In Proceedings of INTERSPEECH (pp. 2297–2300). Florence, Italy.

  • Graff, D., Wu, Z., MacIntyre, R., & Liberman, M. (1997). The 1996 broadcast news speech and language-model corpus. In Proceedings of the DARPA workshop on spoken language technology (pp. 11–14). Citeseer.

  • Kleynhans, N. T. (2013). Automatic speech recognition for resource-scarce environments. Ph.D. thesis, North-West University, Potchefstroom Campus.

  • Moore, R. K. (2003). A comparison of the data requirements of automatic speech recognition systems and human listeners. In Proceedings of EUROSPEECH (pp. 2582–2584). Geneva, Switzerland.

  • Navratil, J. (2001). Spoken language recognition-a step toward multilinguality in speech processing. IEEE Transactions on Speech and Audio Processing, 9(6), 678–685.

    Article  Google Scholar 

  • Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on speech and natural language, association for computational linguistics (pp. 357–362).

  • Rabiner, L. R. (1997). Applications of speech recognition in the area of telecommunications. In Proceedings of the IEEE workshop on automatic speech recognition and understanding, 1997 (pp. 501–510). Santa Barbara, California, USA.

  • Reynolds, D. A. (2001). Automatic speaker recognition: Current approaches and future trends. In Proceedings of the international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–6). Salt Lake City, Utah, USA.

  • Santen, J. P. H., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In: Proceedings of EUROSPEECH, ISCA (pp. 553–556). Rhodes, Greece.

  • Wu, Y., Zhang, R., & Rudnicky, A. (2007). Data selection for speech recognition. In: IEEE workshop on automatic speech recognition and understanding, ASRU, 2007 (pp. 562–565). Pittsburgh, Pennsylvania, USA.

  • Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., et al. (2009). The HTK book. Revised for HTK version 3.4 http://htk.eng.cam.ac.uk.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neil Taylor Kleynhans.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kleynhans, N.T., Barnard, E. Efficient data selection for ASR. Lang Resources & Evaluation 49, 327–353 (2015). https://doi.org/10.1007/s10579-014-9285-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9285-0

Keywords

Navigation