Abstract
In recent years, automatic music-driven choreography has become a highly challenging problem to be solved. In this paper, we propose a music-driven choreography system based on conditional generative adversarial networks. First, a dataset MF-DS integrating MFCC features and Dancing Skeletons extracted from Japanese dancing videos is built by ourselves for the study. The MFCC features are extracted based on music beats, and the dancing skeletons are detected based on the image frames of a video. In the training, we use a generative adversarial network to train the music-driven choreography system. The generator integrates residual blocks into fractionally stridden convolution, and the discriminator involves conventional CNNs. Two indicators called beat loss values and choreography diversity values are proposed to evaluate three learning models in the experiments. Finally, we validate that the three models with the best epochs have the near-zero loss for the generator and discriminator, thereby generating stable skeletons and presenting choreography diversity.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig18_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig19_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig20_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig21_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig22_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig23_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig24_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-021-05752-x/MediaObjects/521_2021_5752_Fig25_HTML.png)
Similar content being viewed by others
Data availability
The dataset MF-DS integrating MFCC features and Dancing Skeletons extracted from Japanese dancing videos is built for our study.
Code availability
OpenPose is a freeware enabling real-time multi-person 2D pose estimation using part affinity fields.
References
Abadi, M, Barham, P, Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). TensorFlow: a system for large-scale machine learning, In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 265–283
Alemi, O., Françoise, J., & Pasquier, P. (2017). GrooveNet: real-time music-driven dance movement generation using artificial neural networks, In Proceedings of the 23rd SIGKDD Workshop on Machine Learning for Creativity.
Alemi, O., & Pasquier, P. (2019). Machine learning for data-driven movement generation: a review of the state of the art, arXiv:1903.08356v1.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN, arXiv:1701.07875v3.
Cao, Z., Hidalgo, G., Simon, T., Wei, S., & Sheikh, Y. (2019). OpenPose: realtime multi-person 2D pose estimation using part affinity fields, arXiv:1812.08008v2.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv:1406.1078v3.
Crnkovic-Friis, L., & Crnkovic-Friis, L. (2016). Generative choreography using deep learning, arXiv:1605.06921v1.
Dumoulin, V., & Visin, F. (2018). A guide to convolution arithmetic for deep learning, arXiv:1603.07285v2.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inform Processing Syst 11:2672–2680
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A (2017) Improved training of Wasserstein GANs. Adv Neural Inform Processing Syst 12:5769–5779
He, K., Zhang, X., Ren, S.,& Sun, J. (2016). Deep residual learning for image recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167v3.
Kingma, D. P., & Ba, J. (2017). Adam: A method for stochastic optimization, arXiv:1412.6980v9.
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Lee, J., Kim, S., & Lee, K. (2018). Listen to dance: music-driven choreography generation using autoregressive encoder-decoder network, arXiv:1811.00818v1.
Lee HY, Yang X, Liu MY, Wang TC, Lu YD, Yang MH, Kautz J (2019) Dancing to music. Adv Neural Inform Processing Syst 13:3581–3591
Levina, E., & Bickel, P. (2001). The earth mover's distance is the mallows distance: some insights from statistics, In Proceedings of the 8th IEEE International Conference on Computer Vision, 251–256
Liu L, Zhang H, Xu X, Zhang Z, Yan S (2020) Collocating clothes with generative adversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans Neural Netw Learn Syst 31(9):3540–3554
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models, In Proceedings of the 30th International Conference on Machine Learning.
Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3D human pose estimation, In Proceedings of the IEEE International Conference on Computer Vision, 2640–2649
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel HP, Xu W, Casas D, Theobalt C (2017) VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans Graph 36(4):1–14
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets, arXiv:1411.1784v1.
Ofli F, Demir Y, Yemez Y, Erzin E, Tekalp AM, Balcı K et al (2008) An audio-driven dancing avatar. J Multimodal User Interfaces 2:93–103
Ofli F, Erzin E, Yemez Y, Tekalp AM (2012) Learn2Dance: learning statistical music-to-dance mappings for choreography synthesis. IEEE Trans Multimedia 14(3):747–759
Oore, S., & Akiyama, Y. (2006). Learning to synthesize arm motion to music by example, In Proceedings of the 14th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, 201–208
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C. & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4903–4911
Pavllo, D., Feichtenhofer, C., Grangier, D. & Auli, M. (2019). 3D human pose estimation in video with temporal convolutions and semi-supervised training, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7753–7762
Simonyan, K. & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556v6.
Srivastava, R. K., Greff, K. & Schmidhuber, J. (2015). Highway networks, arXiv:1505.00387v2.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. & Rabinovich, A. (2015). Going deeper with convolutions, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9
Tang, T., Jia, J., and Mao, H. (2018). Dance with melody: an LSTM-autoencoder approach to music-oriented dance synthesis, In Proceedings of the 26th ACM International Conference on Multimedia, 1598–1606
Taylor, G. W., and Hinton, G. E. (2009). Factored conditional restricted Boltzmann machines for modeling motion style, In Proceedings of the 26th Annual International Conference on Machine Learning, 1025–1032
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272
Zhang H, Sun Y, Liu L et al (2020) ClothingOut: a category-supervised GAN model for clothing segmentation and retrieval. Neural Comput Appl 32:4519–4530
Kim, T., et al. (2018). carpedm20/DCGAN-tensorflow. Retrieved from https://github.com/carpedm20/DCGAN-tensorflow/blob/master/ops.py
McDonald, K. (2018). Dance x machine learning: first steps. Retrieved from https://medium.com/@kcimc/discrete-figures-7d9e9c275c47
【足太ぺんた】 (2013). 夏恋花火 踊ってみた【曇りのち晴れ】. Retrieved from https://youtu.be/E_JrGQdX5vU
【みこ】 (2016). 星屑オーケストラ 踊ってみた【シンガポール】. Retrieved from https://youtu.be/rjlAxvfmRtE
【りりあ】 (2017). 可愛くなりたい【衣装チェンジ3回で】踊ってみた. Retrieved from https://youtu.be/r5rePp_2LHk
Acknowledgements
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Supplementary information
Supplementary file 1
Rights and permissions
About this article
Cite this article
Huang, YF., Liu, WD. Choreography cGAN: generating dances with music beats using conditional generative adversarial networks. Neural Comput & Applic 33, 9817–9833 (2021). https://doi.org/10.1007/s00521-021-05752-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05752-x