Representation Learning for Underdefined Tasks
In the neural network galaxy, the large majority of approaches and research effort is dedicated to defined tasks, like recognize an image of a cat or discriminate noise versus speech records. For these kind of tasks, it is easy to write a labeling reference guide in order to obtain training and evaluation data with a ground truth. But for a large set of high level human tasks, and particularly for tasks related to the artistic field, the task itself is not easy to define, only the result is known, and it is difficult or impossible to write such a labeling book. We name this kind of problem as “Underdefined task”. In this presentation, a methodology based on representation learning is proposed to tackle this class of problems and a practical example is shown in the domain of voice casting for voice dubbing.
KeywordsRepresentation learning Underdefined task Knowledge distillation Transfer learning Voice casting Voice dubbing
Acknowledgment and Credits
The voice casting for voice dubbing work was supported by Avignon University foundation “Pierre Berge” PhD program and by ANR TheVoice project ANR-17-CE23-0025 (DIGITAL VOICE DESIGN FOR THE CREATIVE INDUSTRY).
The main part of the presented work on voice casting was done by Adrien Gresse during his PhD. Some ongoing parts are directly issued from Mathias Quillot’s (on going) PhD. Both provided a large part of the figures and tabs of this presentation.
- 1.Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2014)Google Scholar
- 2.Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V.: Unifying distillation and privileged information. In: International Conference on Learning Representations (2016)Google Scholar
- 4.Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)Google Scholar
- 5.Price, R., Iso, K.-I., Shinoda, K.: Wise teachers train better DNN acoustic models. EURASIP J. Audio Speech Music Process. 2016 (2016)Google Scholar
- 6.Markov, K., Matsui, T.: Robust speech recognition using generalized distillation framework. In: INTERSPEECH (2016)Google Scholar
- 7.Li, J., Seltzer, M.L., Wang, X., Zhao, R., Gong, Y.: Large-scale domain adaptation via teacher-student learning (2017)Google Scholar
- 8.Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R.: Student-teacher network learning with enhanced features. In: Acoustics, Speech and Signal Processing (ICASSP). IEEE (2017)Google Scholar
- 9.Asami, T., Masumura, R., Yamaguchi, Y., Masataki, H., Aono, Y.: Domain adaptation of DNN acoustic models using knowledge distillation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2017)Google Scholar
- 10.Joy, N.M., Kothinti, S.R., Umesh, S., Abraham, B.: Generalized distillation framework for speaker normalization. In: INTERSPEECH (2017)Google Scholar
- 11.Obin, N., Roebel, A., Bachman, G.: On automatic voice casting for expressive speech: speaker recognition vs. speech classification. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2014)Google Scholar
- 12.Gresse, A., Rouvier, M., Dufour, R., Labatut, V., Bonastre, J.-F.: Acoustic pairing of original and dubbed voices in the context of video game localization. In: INTERSPEECH (2017)Google Scholar
- 13.Gresse, A., Quillot, M., Dufour, R., Labatut, V., Bonastre, J.-F.: Similarity metric based on Siamese neural networks for voice casting. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2019)Google Scholar