From context to concept: exploring semantic relationships in music with word2vec

  • Ching-Hua ChuanEmail author
  • Kat Agres
  • Dorien Herremans
Deep learning for music and audio


We explore the potential of a popular distributional semantics vector space model, word2vec, for capturing meaningful relationships in ecological (complex polyphonic) music. More precisely, the skip-gram version of word2vec is used to model slices of music from a large corpus spanning eight musical genres. In this newly learned vector space, a metric based on cosine distance is able to distinguish between functional chord relationships, as well as harmonic associations in the music. Evidence, based on cosine distance between chord-pair vectors, suggests that an implicit circle-of-fifths exists in the vector space. In addition, a comparison between pieces in different keys reveals that key relationships are represented in word2vec space. These results suggest that the newly learned embedded vector representation does in fact capture tonal and harmonic characteristics of music, without receiving explicit information about the musical content of the constituent slices. In order to investigate whether proximity in the discovered space of embeddings is indicative of ‘semantically-related’ slices, we explore a music generation task, by automatically replacing existing slices from a given piece of music with new slices. We propose an algorithm to find substitute slices based on spatial proximity and the pitch class distribution inferred in the chosen subspace. The results indicate that the size of the subspace used has a significant effect on whether slices belonging to the same key are selected. In sum, the proposed word2vec model is able to learn music-vector embeddings that capture meaningful tonal and harmonic relationships in music, thereby providing a useful tool for exploring musical properties and comparisons across pieces, as a potential input representation for deep learning models, and as a music generation device.


Word2vec Music Semantic vector space model 



This research was partly supported through SUTD Grant No. SRG ISTD 2017 129.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Agres K, Cancino C, Grachten M, Lattner S (2015) Harmonics co-occurrences bootstrap pitch and tonality perception in music: evidence from a statistical unsupervised learning model. In: Proceedings of the Cognitive Science SocietyGoogle Scholar
  2. 2.
    Agres K, Abdallah S, Pearce M (2018) Information-theoretic properties of auditory sequences dynamically influence expectation and memory. Cognit Sci 42(1):43–76CrossRefGoogle Scholar
  3. 3.
    Agres KR, McGregor S, Rataj K, Purver M, Wiggins GA (2016) Modeling metaphor perception with distributional semantics vector space models. In: Workshop on computational creativity, concept invention, and general intelligence. Proceedings of 5th international workshop, C3GI at ESSLI, pp 1–14Google Scholar
  4. 4.
    Allan M, Williams CKI (2005) Harmonising chorales by probabilistic inference. In: Proceedings of the advances in neural information processing systems (NIPS), pp 25–32Google Scholar
  5. 5.
    Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155zbMATHGoogle Scholar
  6. 6.
    Besson M, Schön D (2001) Comparison between language and music. Ann N Y Acad Sci 930(1):232–258CrossRefGoogle Scholar
  7. 7.
    Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:12066392
  8. 8.
    Cancino-Chacón C, Grachten M, Agres K (2017) From bach to the beatles: the simulation of human tonal expectation using ecologically-trained predictive models. In: ISMIR, Suzhou, ChinaGoogle Scholar
  9. 9.
    Chacón CEC, Lattner S, Grachten M (2014) Developing tonal perception through unsupervised learning. In: ISMIR, pp 195–200Google Scholar
  10. 10.
    Chew E (2000) Towards a mathematical model of tonality. PhD thesis, Massachusetts Institute of TechnologyGoogle Scholar
  11. 11.
    Chew E et al (2014) Mathematical and computational modeling of tonality. AMC 10:12zbMATHGoogle Scholar
  12. 12.
    Choi K, Fazekas G, Sandler M (2016) Text-based LSTM networks for automatic music composition. arXiv preprint arXiv:160405358
  13. 13.
    Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: The thirty-second AAAI conference on artificial intelligence, AAAI, AAAI, New Orleans, USAGoogle Scholar
  14. 14.
    Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167Google Scholar
  15. 15.
    Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537zbMATHGoogle Scholar
  16. 16.
    Conklin D, Witten IH (1995) Multiple viewpoint systems for music prediction. J New Music Res 24(1):51–73CrossRefGoogle Scholar
  17. 17.
    Dhillon P, Foster DP, Ungar LH (2011) Multi-view learning of word embeddings via CCA. In: Proceedings of advances in neural information processing systems (NIPS), pp 199–207Google Scholar
  18. 18.
    Eck D, Schmidhuber J (2002) Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In: Proceedings of the 2002 12th IEEE workshop on neural networks for signal processing, 2002. IEEE, pp 747–756Google Scholar
  19. 19.
    Erk K (2012) Vector space models of word meaning and phrase meaning: a survey. Lang Linguist Compass 6(10):635–653CrossRefGoogle Scholar
  20. 20.
    Firth JR (1957) A synopsis of linguistic theory, 1930–1955. In: Studies in linguistic analysis. The Philological Society, pp 1–32Google Scholar
  21. 21.
    Goldberg Y, Levy O (2014) word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:14023722
  22. 22.
    Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13(Feb):307–361MathSciNetzbMATHGoogle Scholar
  23. 23.
    Harris ZS (1954) Distributional structure. Word 10(2–3):146–162CrossRefGoogle Scholar
  24. 24.
    Herremans D, Chuan CH (2017) Modeling musical context with word2vec, vol 1. In: First international workshop on deep learning and music joint with IJCNN, Anchorage, USA, pp 11–18Google Scholar
  25. 25.
    Herremans D, Weisser S, Sörensen K, Conklin D (2015) Generating structured music for bagana using quality metrics based on Markov models. Expert Syst Appl 42(21):7424–7435CrossRefGoogle Scholar
  26. 26.
    Herremans D, Chuan CH, Chew E (2017) A functional taxonomy of music generation systems. ACM Comput Surv (CSUR) 50(5):69CrossRefGoogle Scholar
  27. 27.
    Huang CZA, Duvenaud D, Gajos KZ (2016) Chordripple: recommending chords to help novice composers go beyond the ordinary. In: Proceedings of the 21st international conference on intelligent user interfaces, ACM, pp 241–250Google Scholar
  28. 28.
    Huron DB (2006) Sweet anticipation: music and the psychology of expectation. MIT Press, CambridgeGoogle Scholar
  29. 29.
    Kielian-Gilbert M (1990) Interpreting musical analogy: from rhetorical device to perceptual process. Music Percept Interdiscip J 8(1):63–94CrossRefGoogle Scholar
  30. 30.
    Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:14085882
  31. 31.
    Koelsch S, Schmidt Bh, Kansok J (2002) Effects of musical expertise on the early right anterior negativity: an event-related brain potential study. Psychophysiology 39(5):657–663CrossRefGoogle Scholar
  32. 32.
    Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), IEEE, pp 1–6Google Scholar
  33. 33.
    Krumhansl CL (1990) Cognitive foundations of musical pitch. Oxford University Press, OxfordGoogle Scholar
  34. 34.
    Krumhansl CL, Schmuckler M (1990) A key-finding algorithm based on tonal hierarchies. In: Cognitive Foundations of Musical Pitch. Oxford University Press, pp 77–110Google Scholar
  35. 35.
    Lebret R, Collobert R (2013) Word emdeddings through hellinger PCA. arXiv preprint arXiv:13125542
  36. 36.
    Lerdahl F, Jackendoff R (1977) Toward a formal theory of tonal music. J Music Theory 21(1):111–171CrossRefGoogle Scholar
  37. 37.
    Lewin D (1982) A formal theory of generalized tonal functions. J Music Theory 26(1):23–60CrossRefGoogle Scholar
  38. 38.
    Liddy ED, Paik W, Edmund SY, Li M (1999) Multilingual document retrieval system and method using semantic vector matching. US Patent 6,006,221Google Scholar
  39. 39.
    Madjiheurem S, Qu L, Walder C (2016) Chord2vec: learning musical chord embeddings. In: Proceedings of the constructive machine learning workshop at 30th conference on neural information processing systems (NIPS2016), Barcelona, SpainGoogle Scholar
  40. 40.
    McGregor S, Agres K, Purver M, Wiggins GA (2015) From distributional semantics to conceptual spaces: a novel computational method for concept creation. J Artif Gen Intell 6(1):55–86CrossRefGoogle Scholar
  41. 41.
    Meyer LB (1956) Emotion and meaning in music. University of Chicago Press, ChicagoGoogle Scholar
  42. 42.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
  43. 43.
    Mikolov T, Le QV, Sutskever I (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:13094168
  44. 44.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of advances in neural information processing systems (NIPS), pp 3111–3119Google Scholar
  45. 45.
    Mikolov T, Yih Wt, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 746–751Google Scholar
  46. 46.
    Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Proceedings of advances in neural information processing systems (NIPS), pp 1081–1088Google Scholar
  47. 47.
    Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of advances in neural information processing systems (NIPS), pp 2265–2273Google Scholar
  48. 48.
    Noland K, Sandler M (2009) Influences of signal processing, tone profiles, and chord progressions on a model for estimating the musical key from audio. Comput Music J 33(1):42–56CrossRefGoogle Scholar
  49. 49.
    Pearce MT, Wiggins GA (2012) Auditory expectation: the information dynamics of music perception and cognition. Top Cognit Sci 4(4):625–652CrossRefGoogle Scholar
  50. 50.
    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543Google Scholar
  51. 51.
    Poliner GE, Ellis DP (2006) A discriminative model for polyphonic piano transcription. EURASIP J Adv Signal Process 2007(1):048,317CrossRefGoogle Scholar
  52. 52.
    Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544Google Scholar
  53. 53.
    Poria S, Cambria E, Hazarika D, Vij P (2016) A deeper look into sarcastic tweets using deep convolutional neural networks. arXiv preprint arXiv:161008815
  54. 54.
    Saffran JR, Johnson EK, Aslin RN, Newport EL (1999) Statistical learning of tone sequences by human infants and adults. Cognition 70(1):27–52CrossRefGoogle Scholar
  55. 55.
    Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Interspeech, pp 338–342Google Scholar
  56. 56.
    Salton G (1971) The SMART retrieval systemexperiments in automatic document processing. Prentice-Hall, Inc, Upper Saddle RiverGoogle Scholar
  57. 57.
    Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620CrossRefGoogle Scholar
  58. 58.
    Schwartz R, Reichart R, Rappoport A (2015) Symmetric pattern based word embeddings for improved word similarity prediction. In: Proceedings of the nineteenth conference on computational natural language learning, pp 258–267Google Scholar
  59. 59.
    Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., BostonGoogle Scholar
  60. 60.
    Toiviainen P, Eerola T (2016) MIDI toolbox 1.1. Accessed Dec 2018
  61. 61.
    Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Cinema and Interactive Media, School of CommunicationUniversity of MiamiCoral GablesUSA
  2. 2.Social and Cognitive Computing Department, Institute for High Performance ComputingAgency for Science, Technology and Research (A*STAR)SingaporeSingapore
  3. 3.Information Systems, Technology, and Design PillarSingapore University of Technology and DesignSingaporeSingapore

Personalised recommendations