Abstract
An approach for discovering word units in an unknown language under zero resources conditions is presented in this paper. The method is based only on acoustic similarity, combining a cross-lingual phoneme recognition, followed by an identification of consistent strings of phonemes. To this end, a 2-phases algorithm is proposed. The first phase consists of an acoustic-phonetic decoding process, considering a universal set of phonemes, not related with the target language. The goal is to reduce the search space of similar segments of speech, avoiding the quadratic search space if all-to-all speech files are compared. In the second phase, a further refinement of the founded segments is done by means of different approaches based on Dynamic Time Warping. In order to include more hypotheses than only those that correspond to perfect matching in terms of phonemes, an edit distance is calculated for the purpose to also incorporate hypotheses under a given threshold. Three frame representations are studied: raw acoustic features, autoencoders and phoneme posteriorgrams. This approach has been evaluated on the corpus used in Zero resources speech challenge 2017.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Badino, L., Canevari, C., Fadiga, L., Metta, G.: An auto-encoder based approach to unsupervised learning of subword units. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7634–7638, May 2014. https://doi.org/10.1109/ICASSP.2014.6855085
Badino, L., Mereta, A., Rosasco, L.: Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders. In: INTERSPEECH (2015)
Baljekar, P., Sitaram, S., Muthukumar, P.K., Black, A.W.: Using articulatory features and inferred phonological segments in zero resource speech processing. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Józefowicz, R., Bengio, S.: Generating sentences from a continuous space. CoRR abs/1511.06349 (2015). http://arxiv.org/abs/1511.06349
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3079–3087. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf
Driesen, J., ten Bosch, L., hamme, H.V.: Adaptive non-negative matrix factorization in a computational model of language acquisition. In: INTERSPEECH (2009)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). http://dl.acm.org/citation.cfm?id=1953048.2021068
Dunbar, E., et al.: The zero resource speech challenge 2017. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330, December 2017. https://doi.org/10.1109/ASRU.2017.8268953
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems, vol. 6, pp. 3–10. Morgan-Kaufmann (1994). http://papers.nips.cc/paper/798-autoencoders-minimum-description-length-and-helmholtz-free-energy.pdf
Jansen, A., Durme, B.V.: Efficient spoken term discovery using randomized algorithms. In: 2011 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 401–406, December 2011. https://doi.org/10.1109/ASRU.2011.6163965
Jansen, A., Church, K.: Towards unsupervised training of speaker independent acoustic models. In: INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011, pp. 1693–1692 (2011). http://www.isca-speech.org/archive/interspeech_2011/i11_1693.html
Kamper, H., Livescu, K., Goldwater, S.: An embedded segmental k-means model for unsupervised segmentation and clustering of speech. CoRR abs/1703.08135 (2017). http://arxiv.org/abs/1703.08135
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814. Omnipress (2010). http://www.icml2010.org/papers/432.pdf
Park, A.S., Glass, J.R.: Unsupervised pattern discovery in speech. IEEE Trans. Audio Speech Lang. Process. 16(1), 186–197 (2008). https://doi.org/10.1109/TASL.2007.909282
Qin, L., Rudnicky, A.I.: OOV word detection using hybrid models with mixed types of fragments. In: INTERSPEECH (2012)
Räsänen, O.: A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events. Cognition 120(2), 149–176 (2011)
Renshaw, D., Kamper, H., Jansen, A., Goldwater, S.: A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Schwarz, P., Matejka, P., Burget, L., Glembek, O.: Phoneme recognizer based on long temporal context. http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context
Siu, M.H., Gish, H., Chan, A., Belfield, W., Lowe, S.: Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. Comput. Speech Lang. 28(1), 210–223 (2014). https://doi.org/10.1016/j.csl.2013.05.002
Vanhainen, N., Salvi, G.: Word discovery with beta process factor analysis. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Vanhainen, N., Salvi, G.: Pattern discovery in continuous speech using block diagonal infinite hmm. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3719–3723. IEEE (2014)
Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: 2009 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 398–403, November 2009. https://doi.org/10.1109/ASRU.2009.5372931
Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU 2009, pp. 398–403. IEEE (2009)
Acknowledgments
This work was funded by the Spanish MINECO and FEDER founds under contract TIN2017-85854-C4-2-R. Work of José-Ángel González is also financed by Universitat Politècnica de València under grant PAID-01-17.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
García-Granada, F., Sanchis, E., Castro-Bleda, M.J., González, J.Á., Hurtado, LF. (2019). Word Discovering in Low-Resources Languages Through Cross-Lingual Phonemes. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)