Abstract
We explore the potential of a popular distributional semantics vector space model, word2vec, for capturing meaningful relationships in ecological (complex polyphonic) music. More precisely, the skip-gram version of word2vec is used to model slices of music from a large corpus spanning eight musical genres. In this newly learned vector space, a metric based on cosine distance is able to distinguish between functional chord relationships, as well as harmonic associations in the music. Evidence, based on cosine distance between chord-pair vectors, suggests that an implicit circle-of-fifths exists in the vector space. In addition, a comparison between pieces in different keys reveals that key relationships are represented in word2vec space. These results suggest that the newly learned embedded vector representation does in fact capture tonal and harmonic characteristics of music, without receiving explicit information about the musical content of the constituent slices. In order to investigate whether proximity in the discovered space of embeddings is indicative of ‘semantically-related’ slices, we explore a music generation task, by automatically replacing existing slices from a given piece of music with new slices. We propose an algorithm to find substitute slices based on spatial proximity and the pitch class distribution inferred in the chosen subspace. The results indicate that the size of the subspace used has a significant effect on whether slices belonging to the same key are selected. In sum, the proposed word2vec model is able to learn music-vector embeddings that capture meaningful tonal and harmonic relationships in music, thereby providing a useful tool for exploring musical properties and comparisons across pieces, as a potential input representation for deep learning models, and as a music generation device.
Similar content being viewed by others
Notes
References
Agres K, Cancino C, Grachten M, Lattner S (2015) Harmonics co-occurrences bootstrap pitch and tonality perception in music: evidence from a statistical unsupervised learning model. In: Proceedings of the Cognitive Science Society
Agres K, Abdallah S, Pearce M (2018) Information-theoretic properties of auditory sequences dynamically influence expectation and memory. Cognit Sci 42(1):43–76
Agres KR, McGregor S, Rataj K, Purver M, Wiggins GA (2016) Modeling metaphor perception with distributional semantics vector space models. In: Workshop on computational creativity, concept invention, and general intelligence. Proceedings of 5th international workshop, C3GI at ESSLI, pp 1–14
Allan M, Williams CKI (2005) Harmonising chorales by probabilistic inference. In: Proceedings of the advances in neural information processing systems (NIPS), pp 25–32
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
Besson M, Schön D (2001) Comparison between language and music. Ann N Y Acad Sci 930(1):232–258
Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:12066392
Cancino-Chacón C, Grachten M, Agres K (2017) From bach to the beatles: the simulation of human tonal expectation using ecologically-trained predictive models. In: ISMIR, Suzhou, China
Chacón CEC, Lattner S, Grachten M (2014) Developing tonal perception through unsupervised learning. In: ISMIR, pp 195–200
Chew E (2000) Towards a mathematical model of tonality. PhD thesis, Massachusetts Institute of Technology
Chew E et al (2014) Mathematical and computational modeling of tonality. AMC 10:12
Choi K, Fazekas G, Sandler M (2016) Text-based LSTM networks for automatic music composition. arXiv preprint arXiv:160405358
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: The thirty-second AAAI conference on artificial intelligence, AAAI, AAAI, New Orleans, USA
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537
Conklin D, Witten IH (1995) Multiple viewpoint systems for music prediction. J New Music Res 24(1):51–73
Dhillon P, Foster DP, Ungar LH (2011) Multi-view learning of word embeddings via CCA. In: Proceedings of advances in neural information processing systems (NIPS), pp 199–207
Eck D, Schmidhuber J (2002) Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In: Proceedings of the 2002 12th IEEE workshop on neural networks for signal processing, 2002. IEEE, pp 747–756
Erk K (2012) Vector space models of word meaning and phrase meaning: a survey. Lang Linguist Compass 6(10):635–653
Firth JR (1957) A synopsis of linguistic theory, 1930–1955. In: Studies in linguistic analysis. The Philological Society, pp 1–32
Goldberg Y, Levy O (2014) word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:14023722
Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13(Feb):307–361
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Herremans D, Chuan CH (2017) Modeling musical context with word2vec, vol 1. In: First international workshop on deep learning and music joint with IJCNN, Anchorage, USA, pp 11–18
Herremans D, Weisser S, Sörensen K, Conklin D (2015) Generating structured music for bagana using quality metrics based on Markov models. Expert Syst Appl 42(21):7424–7435
Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv (CSUR) 50(5):69
Huang CZA, Duvenaud D, Gajos KZ (2016) Chordripple: recommending chords to help novice composers go beyond the ordinary. In: Proceedings of the 21st international conference on intelligent user interfaces, ACM, pp 241–250
Huron DB (2006) Sweet anticipation: music and the psychology of expectation. MIT Press, Cambridge
Kielian-Gilbert M (1990) Interpreting musical analogy: from rhetorical device to perceptual process. Music Percept Interdiscip J 8(1):63–94
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:14085882
Koelsch S, Schmidt Bh, Kansok J (2002) Effects of musical expertise on the early right anterior negativity: an event-related brain potential study. Psychophysiology 39(5):657–663
Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6
Krumhansl CL (1990) Cognitive foundations of musical pitch. Oxford University Press, Oxford
Krumhansl CL, Schmuckler M (1990) A key-finding algorithm based on tonal hierarchies. In: Cognitive Foundations of Musical Pitch. Oxford University Press, pp 77–110
Lebret R, Collobert R (2013) Word emdeddings through hellinger PCA. arXiv preprint arXiv:13125542
Lerdahl F, Jackendoff R (1977) Toward a formal theory of tonal music. J Music Theory 21(1):111–171
Lewin D (1982) A formal theory of generalized tonal functions. J Music Theory 26(1):23–60
Liddy ED, Paik W, Edmund SY, Li M (1999) Multilingual document retrieval system and method using semantic vector matching. US Patent 6,006,221
Madjiheurem S, Qu L, Walder C (2016) Chord2vec: learning musical chord embeddings. In: Proceedings of the constructive machine learning workshop at 30th conference on neural information processing systems (NIPS2016), Barcelona, Spain
McGregor S, Agres K, Purver M, Wiggins GA (2015) From distributional semantics to conceptual spaces: a novel computational method for concept creation. J Artif Gen Intell 6(1):55–86
Meyer LB (1956) Emotion and meaning in music. University of Chicago Press, Chicago
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:13094168
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of advances in neural information processing systems (NIPS), pp 3111–3119
Mikolov T, Yih Wt, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 746–751
Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Proceedings of advances in neural information processing systems (NIPS), pp 1081–1088
Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of advances in neural information processing systems (NIPS), pp 2265–2273
Noland K, Sandler M (2009) Influences of signal processing, tone profiles, and chord progressions on a model for estimating the musical key from audio. Comput Music J 33(1):42–56
Pearce MT, Wiggins GA (2012) Auditory expectation: the information dynamics of music perception and cognition. Top Cognit Sci 4(4):625–652
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Poliner GE, Ellis DP (2006) A discriminative model for polyphonic piano transcription. EURASIP J Adv Signal Process 2007(1):048,317
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544
Poria S, Cambria E, Hazarika D, Vij P (2016) A deeper look into sarcastic tweets using deep convolutional neural networks. arXiv preprint arXiv:161008815
Saffran JR, Johnson EK, Aslin RN, Newport EL (1999) Statistical learning of tone sequences by human infants and adults. Cognition 70(1):27–52
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Interspeech, pp 338–342
Salton G (1971) The SMART retrieval systemexperiments in automatic document processing. Prentice-Hall, Inc, Upper Saddle River
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620
Schwartz R, Reichart R, Rappoport A (2015) Symmetric pattern based word embeddings for improved word similarity prediction. In: Proceedings of the nineteenth conference on computational natural language learning, pp 258–267
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston
Toiviainen P, Eerola T (2016) MIDI toolbox 1.1. https://github.com/miditoolbox/. Accessed Dec 2018
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Acknowledgements
This research was partly supported through SUTD Grant No. SRG ISTD 2017 129.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Fig. 10.
Rights and permissions
About this article
Cite this article
Chuan, CH., Agres, K. & Herremans, D. From context to concept: exploring semantic relationships in music with word2vec. Neural Comput & Applic 32, 1023–1036 (2020). https://doi.org/10.1007/s00521-018-3923-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3923-1