Advertisement

SATIN: a persistent musical database for music information retrieval and a supporting deep learning experiment on song instrumental classification

  • Yann Bayle
  • Matthias Robine
  • Pierre Hanna
Article
  • 141 Downloads

Abstract

This paper introduces SATIN, the Set of Audio Tags and Identifiers Normalized. SATIN is a database of 400k audio-related metadata and identifiers that aims at facilitating reproducibility and comparisons among the Music Information Retrieval (MIR) algorithms. The idea is to take advantage of partnerships between scientists and private companies that host millions of tracks. Scientists can send their feature extraction algorithm to companies along SATIN identifiers and retrieve the corresponding features. This procedure allows the MIR community to have access to more tracks for classification purposes. Afterwards, scientists can provide to the MIR community the classification result for each track, which can then be compared with other algorithms results. SATIN thus resolves the major problems of accessing more tracks, managing copyrights locks, saving computation time, and guaranteeing consistency over research databases. We introduce SOFT1, the first Set Of FeaTures extracted by a company thanks to SATIN. We propose a supporting experiment classifying instrumentals and songs to detail a possible use of SATIN. We compare a deep learning approach —that has emerged in recent years in MIR— with a knowledge-based approach.

Keywords

Acoustic signal processing Classification of instrumentals and songs Content-based audio retrieval Database Machine learning algorithms Music information retrieval Music recommendation Playlist generation Reproducibility Signal analysis Signal processing algorithms Music autotagging 

Notes

Acknowledgements

The authors thank Musixmatch for their metadata and the Research and Development team of Deezer for extracting the audio features. The authors thank Florian Iragne from Simbals, for his help with ISRC and musical metadata handling. The authors thank Fidji Berio and Kimberly Malcolm for insightful proofreading.

This work has been partially funded by the Charles University, project GA UK No. 1580317, project SVV 260451, by the internal grant agency of VŠB - Technical University of Ostrava, under the project no. SP2017/177 “Optimization of machine learning algorithms for the HPC platform”, by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”. All findings and points of view expressed in this paper are those of the authors and do not necessarily reflect the views of their academic and industrial partners.

Part of the computer time for this study was provided by the computing facilities MCIA (Mésocentre de Calcul Intensif Aquitain) of the Université de Bordeaux and of the Université de Pau et des Pays de l’Adour.

References

  1. 1.
    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX symposium on operating system design implementation, vol 16, pp 265– 283Google Scholar
  2. 2.
    Bayle Y, Hanna P, Robine M (2016) Classification à grande échelle de morceaux de musique en fonction de la présence de chant. In: Journées d’informatique musicale, Albi, France, pp 144–152Google Scholar
  3. 3.
    Bekios-Calfa J, Buenaposada J M, Baumela L (2011) Revisiting linear discriminant techniques in gender recognition. IEEE Trans Pattern Anal Mach Intell 33(4):858–864CrossRefGoogle Scholar
  4. 4.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRefGoogle Scholar
  5. 5.
    Bertin-Mahieux T, Ellis D P W, Whitman B, Lamere P (2011) The million song dataset. In: Proceedings of the 12th international society for music information retrieval conference, Miami, FL, USA, pp 591–596Google Scholar
  6. 6.
    Bittner R M, Salamon J, Tierney M, Mauch M, Cannam C, Bello J P (2014) MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: Proceedings of the 15th international society for music information retrieval conference, Taipei, Taiwan, pp 155–160Google Scholar
  7. 7.
    Bogdanov D, Serrà J, Wack N, Herrera P, Serra X (2011) Unifying low-level and high-level music similarity measures. IEEE Trans Multimedia 13(4):687–701CrossRefGoogle Scholar
  8. 8.
    Bogdanov D, Wack N, Gómez E, Gulati S, Herrera P, Mayor O, Roma G, Salomon J, Zapata J R, Serra X (2013) Essentia: an audio analysis library for music information retrieval. In: Proceedings of the 14th international society for music information retrieval conference, Curitiba, Brazil, pp 493– 498Google Scholar
  9. 9.
    Cheng Z, Shen J (2014) Just-for-me: an adaptive personalization system for location-aware social music recommendation. In: Proceedings of international conference on multimedia retrieval. ACM, p 185Google Scholar
  10. 10.
    Choi K, Fazekas G, Sandler M, Kim J (2015) Auralisation of deep convolutional neural networks: Listening to learned features. In: Proceedings of the 16th international society for music information retrieval conference, pp 26–30Google Scholar
  11. 11.
    Choi K, Fazekas G, Sandler M B (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th international society for music information retrieval conference, New York, NY, USA, pp 805–811Google Scholar
  12. 12.
    Choi K, Fazekas G, Cho K, Sandler M (2017) A comparison on audio signal preprocessing methods for deep neural networks on music tagging. arXiv:1709.01922
  13. 13.
    Chollet F (2015) Keras: deep learning library for theano and tensorflow. Tech. RepGoogle Scholar
  14. 14.
    Cover T, Hart P E (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27CrossRefzbMATHGoogle Scholar
  15. 15.
    Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) Fma: a dataset for music analysis. In: Proceedings of the 18th international society for music information retrieval conferenceGoogle Scholar
  16. 16.
    Eronen A, Klapuri A (2000) Musical instrument recognition using cepstral coefficients and temporal features. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 2. IEEE, pp II753–II756Google Scholar
  17. 17.
    Fernández C, Huerta I, Prati A (2015) A comparative evaluation of regression learning algorithms for facial age estimation. In: Ji Q, Moeslund T, Hua G, Nasrollahi K (eds) Face and facial expression recognition from real world videos. Springer, Cham, pp 133–144Google Scholar
  18. 18.
    Foote J T (1997) Content-based retrieval of music and audio. In: Multimedia storage and archiving systems II, international society for optics and photonics, vol 3229, pp 138–148Google Scholar
  19. 19.
    Ghosal A, Chakraborty R, Dhara B C, Saha S K (2013) A hierarchical approach for speech-instrumental-song classification. SpringerPlus 2(526):1–11Google Scholar
  20. 20.
    Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 315–323Google Scholar
  21. 21.
    Goto M, Hashiguchi H, Nishimura T, Oka R (2002) RWC music database: popular, classical and jazz music databases. In: Proceedings of the 3rd international conference on music information retrieval, Paris, France, pp 287–288Google Scholar
  22. 22.
    Gouyon F, Sturm B L, Oliveira J L, Hespanhol N, Langlois T (2014) On evaluation validity in music autotagging. arXiv:1410.0001
  23. 23.
    Hennequin R, Moussallam M (2015) Detection and characterization of singing voice using deep neural networks. Tech. rep., DeezerGoogle Scholar
  24. 24.
    Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J, Wilson K (2017) Cnn architectures for large-scale audio classification. In: ICASSP. IEEE, pp 131–135Google Scholar
  25. 25.
    Hespanhol N (2013) Using autotagging for classification of vocals in music signals. PhD Thesis, University of Porto, PortugalGoogle Scholar
  26. 26.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456Google Scholar
  27. 27.
    Jeon B, Kim C, Kim A, Kim D, Park J, Ha J W (2017) Music emotion recognition via end-to-end multimodal neural networks. In: RECSYSGoogle Scholar
  28. 28.
    Kim Y E, Whitman B (2002) Singer identification in popular music recordings using voice coding features. In: Proceedings of the 3rd international conference on music information retrieval, Paris, France, pp 17–23Google Scholar
  29. 29.
    Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Proceedings of the 25th conference on advances neural information processing systems. Curran Associates, Inc., pp 1097–1105Google Scholar
  30. 30.
    Law E, West K, Mandel M I, Bay M, Downie J S (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th international society for music information retrieval conference, Kobe, Japan, pp 387–392Google Scholar
  31. 31.
    Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceedings of the 40th IEEE international conference on acoustics, speech, and signal processing, Brisbane, Australia, pp 121–125Google Scholar
  32. 32.
    Lehner B, Widmer G (2015) Monaural blind source separation in the context of vocal detection. In: Proceedings of the 16th international society for music information retrieval conference, pp 309–315Google Scholar
  33. 33.
    Lehner B, Widmer G, Böck S (2015) A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In: Proceedings of the 23rd european signal processing conference, Nice, France, pp 21–25Google Scholar
  34. 34.
    Lerch A (2012) An introduction to audio content analysis: applications in signal processing and music informatics. Wiley, New YorkCrossRefGoogle Scholar
  35. 35.
    Liutkus A, Fitzgerald D, Rafii Z, Pardo B, Daudet L (2014) Kernel additive models for source separation. IEEE Trans Signal Process 62(16):4298–4310MathSciNetCrossRefGoogle Scholar
  36. 36.
    Livshin A, Rodet X (2003) The importance of cross database evaluation in sound classification. In: Proceedings of the 4th international conference on music information retrieval, Baltimore, MD, USA, pp 1–2Google Scholar
  37. 37.
    Llamedo M, Khawaja A, Martinez J P (2012) Cross-database evaluation of a multilead heartbeat classifier. IEEE Trans Inf Technol Biomed 16(4):658–664CrossRefGoogle Scholar
  38. 38.
    Lyu Q, Wu Z, Zhu J (2015) Polyphonic music modelling with lstm-rtrbm. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 991–994Google Scholar
  39. 39.
    Marques G, Domingues M A, Langlois T, Gouyon F (2011) Three current issues in music autotagging. In: Proceedings of the 12th international society for music information retrieval conference, Miami, FL, USA, pp 795–800Google Scholar
  40. 40.
    Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th international society for music information retrieval conference, Utrecht, Netherlands, pp 441–446Google Scholar
  41. 41.
    McEnnis D, McKay C, Fujinaga I (2006) Overview of OMEN. In: Proceedings of the 7th international conference on music information retrieval, Victoria, BC, Canada, pp 7–12Google Scholar
  42. 42.
    McFee B, Raffel C, Liang D, Ellis D P W, McVicar M, Battenberg E, Nieto O (2015) Librosa Audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25Google Scholar
  43. 43.
    Moore B C J (2012) An introduction to the psychology of hearing. Brill, LeidenGoogle Scholar
  44. 44.
    Ng A Y (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the 14th international conference on machine learning, Nashville, TN, USA, pp 245–253Google Scholar
  45. 45.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in Python. J Mach Learning Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  46. 46.
    Rabiner L R, Juang B H (1993) Fundamentals of speech recognition. PTR Prentice Hall, Englewood CliffszbMATHGoogle Scholar
  47. 47.
    Raina R, Madhavan A, Ng A Y (2009) Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 873–880Google Scholar
  48. 48.
    Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, Las Vegas, NV, USA, pp 1885–1888Google Scholar
  49. 49.
    Rocamora M, Herrera P (2007) Comparing audio descriptors for singing voice detection in music audio files. In: Proceedings of the 11th Brazilian symposium on computer music, San Pablo, Brazil, vol 26, p 27Google Scholar
  50. 50.
    Roma G, Grais E M, Simpson A J, Plumbley M D (2016) Singing voice separation using deep neural networks and f0 estimation. In: MIREXGoogle Scholar
  51. 51.
    Schlüter J (2016) Learning to pinpoint singing voice from weakly labeled examples. In: Proceedings of the 17th international society for music information retrieval conference, New York, NY, USA, pp 44–50Google Scholar
  52. 52.
    Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference, Málaga, Spain, pp 121–126Google Scholar
  53. 53.
    Shen J, Meng W, Yan S, Pang H, Hua X (2010) Effective music tagging through advanced statistical modeling. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 635–642Google Scholar
  54. 54.
    Shen J, Pang H, Wang M, Yan S (2012) Modeling concept dynamics for large scale music search. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 455–464Google Scholar
  55. 55.
    Silla C N Jr, Koerich A L, Kaestner C A A (2008) The latin music database. In: Proceedings of the 9th international conference on music information retrieval, pp 451–456Google Scholar
  56. 56.
    Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  57. 57.
    Sturm B L (2014) The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research 43(2):147–172CrossRefGoogle Scholar
  58. 58.
    Sturm B L (2015) Faults in the latin music database and with its use. In: Proceedings of the late breaking demo 16th international society for music information retrieval conference, Málaga, Spain, pp 1–2Google Scholar
  59. 59.
    Tachibana H, Ono T, Ono N, Sagayama S (2010) Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing. IEEE, pp 425–428Google Scholar
  60. 60.
    Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16 (2):467–476CrossRefGoogle Scholar
  61. 61.
    Tzanetakis G, Cook P (2000) Marsyas: a framework for audio analysis. Organised Sound 4(3):169–175CrossRefGoogle Scholar
  62. 62.
    Valin JM (2017) A hybrid dsp/deep learning approach to real-time full-band speech enhancement. Tech. repGoogle Scholar
  63. 63.
    Velarde G (2017) Convolutional methods for music analysis. PhD Thesis, Aalborg UniversitetsforlagGoogle Scholar
  64. 64.
    Wang X, Wang Y (2014) Improving content-based and hybrid music recommendation using deep learning. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 627–636Google Scholar
  65. 65.
    West K, Cox S (2004) Features and classifiers for the automatic classification of musical audio signals. In: Proceedings of the 5th international conference on music information retrievalGoogle Scholar
  66. 66.
    Yoshii K, Goto M, Komatani K, Ogata T, Okuno H G (2007) Improving efficiency and scalability of model-based music recommender system based on incremental training. In: Proceedings of the 8th international conference on music information retrieval, Vienna, Austria, pp 89–94Google Scholar
  67. 67.
    Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The HTK book, vol 3. Cambridge University Engineering Department, CambridgeGoogle Scholar
  68. 68.
    Zhao Z, Wang X, Xiang Q, Sarroff A M, Li Z, Wang Y (2010) Large-scale music tag recommendation with explicit multiple attributes. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 401–410Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.LaBRI, UMR 5800University of BordeauxTalenceFrance
  2. 2.LaBRI, UMR 5800CNRSTalenceFrance

Personalised recommendations