Fusion of Text and Audio Semantic Representations Through CCA

  • Kamelia Aryafar
  • Ali Shokoufandeh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8869)


Humans are natural multimedia processing machines. Multimedia is a domain of multi-modalities including audio, text and images. A central aspect of multimedia processing is the coherent integration of media from different modalities as a single identity. Multimodal information fusion architectures become a necessity when not all information channels are available at all times. In this paper, we introduce a multimodal fusion of audio signals and lyrics in a shared semantic space through canonical correlation analysis. We propose an audio retrieval system based on extended semantic analysis of audio signals. We will combine this model with a tf-idf representation of lyrics to achieve a multimodal retrieval system. We use canonical correlation analysis and supervised learning methods as a basis for relating audio and lyrics information. Our experimental evaluation of the proposed method indicated that the proposed model outperforms the prior approaches based on simple canonical correlation methods. Finally, the efficiency of the proposed method allows for dealing with large music and lyrics collections enabling users to explore relevant lyrics information for music datasets.


Content-based music information retrieval Cross-domain retrievals Multimodal information fusion 


  1. 1.
    Aryafar, K., Jafarpour, S., Shokoufandeh, A.: Music genre classification using sparsity-eager support vector machines. Technical reportGoogle Scholar
  2. 2.
    Aryafar, K., Jafarpour, S., Shokoufandeh, A.: Automatic musical genre classification using sparsity-eager support vector machines. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 1526–1529. IEEE (2012)Google Scholar
  3. 3.
    Aryafar, K., Shokoufandeh, A.: Music genre classification using explicit semantic analysis. In: Proceedings of the 1st International ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies, pp. 33–38. ACM (2011)Google Scholar
  4. 4.
    Pradeep, K., Atrey, M., Hossain, A., Saddik, A.E., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)CrossRefGoogle Scholar
  5. 5.
    Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011) (2011)Google Scholar
  6. 6.
    Dorai, C., Venkatesh, S.: Bridging the semantic gap in content management systems. In: Dorai, C., Venkatesh, S. (eds.) Media Computing, pp. 1–9. Springer, New York (2002)CrossRefGoogle Scholar
  7. 7.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)Google Scholar
  8. 8.
    Jensen, B.S., Troelsgaard, R., Larsen, J., Hansen, L.K.: Towards a universal representation for audio information retrieval and analysis. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3168–3172. IEEE (2013)Google Scholar
  9. 9.
    Kim, Y.E., Schmidt, E.M., Migneco, R., Morton, B.G., Richardson, P., Scott, J., Speck, J.A., Turnbull, D.: Music emotion recognition: a state of the art review. In: Proceedings of ISMIR, pp. 255–266. Citeseer (2010)Google Scholar
  10. 10.
    Li, T.L.H., Chan, A.B.: Genre classification and the invariance of MFCC features to key and tempo. In: Lee, K.-T., Tsai, W.-H., Liao, H.-Y.M., Chen, T., Hsieh, J.-W., Tseng, C.-C. (eds.) MMM 2011 Part I. LNCS, vol. 6523, pp. 317–327. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Mandel, M.I., Ellis, D.P.W.: Song-level features and support vector machines for music classification. In: Reiss, J.D., Wiggins, G.A. (eds.) Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pp. 594–599, September 2005Google Scholar
  12. 12.
    McVicar, M., De Bie, T.: CCA and a multi-way extension for investigating common components between audio, lyrics and tags. In: Proceedings of the 9th International Symposium on Computational Music Modeling and Retrieval (CMMR), pp. 53–68 (2012)Google Scholar
  13. 13.
    Schüssel, F., Honold, F., Weber, M.: MPRSS 2012. LNCS, vol. 7742. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  14. 14.
    Typke, R., Wiering, F., Veltkamp, R.C.: A survey of music information retrieval systems. In: ISMIR, pp. 153–160 (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Computer Science DepartmentDrexel UniversityPhiladelphiaUSA

Personalised recommendations