Machine Learning

, Volume 65, Issue 2–3, pp 439–456 | Cite as

Classification-based melody transcription

  • Daniel P. W. Ellis
  • Graham E. Poliner


The melody of a musical piece—informally, the part you would hum along with—is a useful and compact summary of a full audio recording. The extraction of melodic content has practical applications ranging from content-based audio retrieval to the analysis of musical structure. Whereas previous systems generate transcriptions based on a model of the harmonic (or periodic) structure of musical pitches, we present a classification-based system for performing automatic melody transcription that makes no assumptions beyond what is learned from its training data. We evaluate the success of our algorithm by predicting the melody of the ADC 2004 Melody Competition evaluation set, and we show that a simple frame-level note classifier, temporally smoothed by post processing with a hidden Markov model, produces results comparable to state of the art model-based transcription systems.


Melody transcription Audio Music Support vector machine Hidden markov model Imbalanced data sets Multiway classification 


  1. Birmingham, W., Dannenberg, R., Wakefield, G., Bartsch, M., Bykowski, D., Mazzoni, D., Meek, C., Mellody, M., & Rand, W. (2001). MUSART: Music retrieval via aural queries. In Proc. 2nd Annual International Symposium on Music Information Retrieval ISMIR-01 (pp. 73–82). Bloomington, IN.Google Scholar
  2. Chawla, N., Japkowicz, N., & Kolcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6. URL Scholar
  3. de Cheveigne, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. Journal Acoustic Society of America, 111(4), 1917–1930.CrossRefGoogle Scholar
  4. Downie, J., West K., Ehmann, A., & Vincent, E. (2005). The 2005 music information retrieval evaluation exchange (MIREX 2005): Preliminary overview. In Proc. 6th International Symposium on Music Information Retrieval ISMIR (pp. 320–323). London.Google Scholar
  5. Eggink, J., & Brown, G. J. (2004). Extracting melody lines from complex audio. In International Symposium on Music Information Retrieval (pp. 84–91).Google Scholar
  6. Gomez, E., Ong, B., & Streich, S. (2004). Ismir 2004 melody extraction competition contest definition page, Scholar
  7. Goto, M. (2004). A predominant-f0 estimation method for polyphonic musical audio signals. In 18th International Congress on Acoustics (pp. 1085–1088).Google Scholar
  8. Goto, M., & Hayamizu, S. (1999). A real-time music scene description system: Detecting melody and bass lines in audio signals. In Working Notes of the IJCAI-99 Workshop on Computational Auditory Scene Analysis (pp. 31–40). Stockholm.Google Scholar
  9. Lamel, L., Gauvain, J.-L., & Adda, G. (2002). Lightly supervised and unsupervised acoustic model training. Computer, Speech & Language, 16(1), 115–129. URL Scholar
  10. Li, Y., & Wang, D. L. (2005). Detecting pitch of singing voice in polyphonic audio. In IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. III.17–21).Google Scholar
  11. Marolt, M. (2004). On finding melodic lines in audio recordings. In Proc. 7th International Conference on Digital Audio Effects DAFx’04, Naples, Italy. URL Scholar
  12. Paiva, R. P., Mendes, T., & Cardoso, A. (2005) On the detection of melody notes in polyphonic audio. In Proc. 6th International Symposium on Music Information Retrieval ISMIR (pp. 175–182). London.Google Scholar
  13. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning (pp. 185–208). Cambridge, MA, MIT Press.Google Scholar
  14. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classication. Journal of Machine Learning Research, 5, 101–141, URL Scholar
  15. Sjölander, K., & Beskow, J. (2000). WaveSurfer—an open source speech tool. In Proc. Int. Conf. on Spoken Language Processing.Google Scholar
  16. Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). In W. B. Kleijn & K. K. Paliwal, (Eds.), Speech coding and synthesis, chapter 14 (pp. 495–518). Elsevier, Amsterdam.Google Scholar
  17. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin markov networks. In Proc. Neural Information Processing Systems NIPS, Vancouver, URL taskar/pubs/ Scholar
  18. Turetsky, R. J., & Ellis, D. P. W. (2003). Ground-truth transcriptions of real music from force-aligned midi syntheses. In Proc. Int. Conf. on Music Info. Retrieval ISMIR-03.Google Scholar
  19. Witten, I. H., & Frank, E. (2000). Data Mining: Practical machine learning tools with Java implementations. San Francisco, CA, USA, Morgan Kaufmann, ISBN 1-55860-552-5.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.LabROSA, Department of Electrical EngineeringColumbia UniversityNew YorkUSA

Personalised recommendations