Skip to main content
Log in

Evaluation in Music Information Retrieval

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The field of Music Information Retrieval has always acknowledged the need for rigorous scientific evaluations, and several efforts have set out to develop and provide the infrastructure, technology and methodologies needed to carry out these evaluations. The community has enormously gained from these evaluation forums, but we have reached a point where we are stuck with evaluation frameworks that do not allow us to improve as much and as well as we want. The community recently acknowledged this problem and showed interest in addressing it, though it is not clear what to do to improve the situation. We argue that a good place to start is again the Text IR field. Based on a formalization of the evaluation process, this paper presents a survey of past evaluation work in the context of Text IR, from the point of view of validity, reliability and efficiency of the experiments. We show the problems that our community currently has in terms of evaluation, point to several lines of research to improve it and make various proposals in that line.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://ismir2012.ismir.net

  2. http://www.acm.org/about/class/2012

  3. http://mires.eecs.qmul.ac.uk/wiki/index.php/MIR_Challenges

  4. http://trec.nist.gov

  5. http://research.nii.ac.jp/ntcir/

  6. http://www.clef-initiative.eu

  7. http://inex.mmci.uni-saarland.de

  8. http://www.music-ir.org/mirex/wiki/MIREX_HOME

  9. http://www.multimediaeval.org/mediaeval2012/newtasks/music2012/

  10. http://labrosa.ee.columbia.edu/millionsong/challenge

  11. http://www.shazam.com

  12. http://www.jamendo.com

  13. http://archive.org

  14. http://trec.nist.gov/trec_eval/

  15. http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

  16. http://www.sigkdd.org/kdd2011/kddcup.shtml

References

  • Al-Maskari, A., Sanderson, M., Clough, P. (2007). The relationship between IR effectiveness measures and user satisfaction. In International ACM SIGIR conference on research and development in information retrieval (pp. 773–774).

    Google Scholar 

  • Allan, J. & Croft, B. (2003). Challenges in information retrieval and language modeling. ACM SIGIR Forum, 37(1), 31–47.

    Article  Google Scholar 

  • Allan, J., Carterette, B., Lewis, J. (2005). When will information retrieval be ’good enough’? In International ACM SIGIR conference on research and development in information retrieval (pp. 433–440).

  • Allan, J., Croft, B., Moffat, A., Sanderson, M. (2012). Frontiers, challenges and opportunities for information retrieval: report from SWIRL 2012. ACM SIGIR Forum, 46(1), 2–32.

    Article  Google Scholar 

  • Alonso, O. & Mizzaro, S. (2012). Using crowdsourcing for T.R.E.C. relevance assessment. Information Processing and Management, 48(6), 1053–1066.

    Article  Google Scholar 

  • Armstrong, T.G., Moffat, A., Webber, W., Zobel, J. (2009). Improvements that don’t add up: ad-hoc retrieval results since 1998. In ACM international conference on information and knowledge management (pp. 601–610).

  • Aslam, J.A. & Yilmaz, E. (2007). Inferring document relevance from incomplete information. In ACM international conference on information and knowledge management (pp. 633–642).

  • Aslam, J.A., Pavlu, V., Savell, R. (2003). A. unified model for metasearch, pooling and system evaluation. In ACM International conference on information and knowledge management (pp. 484–491).

  • Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval (pp. 667–674).

  • Bennett, P.N., Carterette, B., Chapelle, O., Joachims, T. (2008). Beyond binary relevance: preferences, diversity and set-level judgments. ACM SIGIR Forum, 42(2), 53–58.

    Article  Google Scholar 

  • Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P. (2011). The million song dataset. In International society for music information retrieval conference.

  • Bodoff, D. & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval (pp. 367–374).

  • Buckley, C. & Voorhees, E.M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval (pp. 33–34).

  • Buckley, C. & Voorhees, E.M. (2004). Retrieval evaluation with incomplete information. In International ACM SIGIR conference on research and development in information retrieval (pp. 25–32).

  • Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.

    Article  Google Scholar 

  • Cano, P., Gómez, E., Gouyon, F., Herrera, P., Koppenberger, M., Ong, B., Serra, X., Streich, S., Wack, N. (2006). ISMIR 2004 Audio Description Contest. Tech. Rep. MTG-TR-2006-02, Universitat Pompeu Fabra.

  • Carterette, B. (2007). Robust test collections for retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 55–62).

  • Carterette, B. (2011). System effectiveness, user models, and user utility: a general framework for investigation. In International ACM SIGIR conference on research and development in information retrieval (pp. 903–912).

  • Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1).

  • Carterette, B., & Allan, J. (2007). Semiautomatic evaluation of retrieval systems using document similarities. In ACM international conference on information and knowledge management (pp. 873–876).

  • Carterette, B., & Smucker, M.D. (2007). Hypothesis testing with incomplete relevance judgments. In ACM international conference on information and knowledge management (pp. 643–652).

  • Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 539–546).

  • Carterette, B., Allan, J., Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 268–275).

  • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J. (2009). If I had a million queries. In European conference on information retrieval (pp. 288–300).

  • Carterette, B., Gabrilovich, E., Josifovski, V., Metzler, D. (2010). Measuring the reusability of test collections. In ACM International conference on web search and data mining (pp. 231–240)

  • Carterette, B., Kanoulas, E., Pavlu, V., Fang, H. (2010). Reusable test collections through experimental design. In International ACM SIGIR conference on research and development in information retrieval (pp. 547–554).

  • Carvalho, V.R., Lease, M., Yilmaz, E. (2010). Crowdsourcing for search evaluation. ACM SIGIR Forum 44(2), 17–22.

    Article  Google Scholar 

  • Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In ACM international conference on information and knowledge management (pp. 621–630).

  • Cleverdon, C.W. (1991). The significance of the cranfield tests on index languages. In International ACM SIGIR conference on research and development in information retrieval (pp. 3–12).

  • Cormack, G.V., & Lynam, T.R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 533–540).

  • Cormack, G.V., Palmer, C.R., Clarke, C.L. (1998). Efficient construction of large test collections. In International ACM SIGIR conference on research and development in information retrieval (pp. 282–289).

  • Cunningham, S.J., Bainbridge, D., Downie, J.S. (2012). The impact of MIREX on scholarly research (2005–2010). In International society for music information retrieval conference (pp. 259–264).

  • Downie, J.S. (2002). Interim report on establishing MIR/MDL evaluation frameworks: Commentary on consensus building. In ISMIR panel on music information retrieval evaluation frameworks (pp. 43–44).

  • Downie, J.S. (2003). The MIR/MDL Evaluation Project White Paper Collection, 3rd edn. http://www.music-ir.org/evaluation/wp.html.

  • Downie, J.S. (2004). The scientific evaluation of music information retrieval systems: Foundations and future. Computer Music Journal, 28(2), 12–23.

    Article  Google Scholar 

  • Downie, J.S., Bay, M., Ehmann, A.F., Jones, M.C. (2008). Audio cover song identification: MIREX 2006–2007 results and analysis. In International conference on music information retrieval.

  • Downie, J.S., Ehmann, A.F., Bay, M., Jones, M.C. (2010). The music information retrieval evaluation exchange: Some observations and insights. In Zbigniew, W.R., Wieczorkowska, A.A. (Eds.), Advances in music information retrieval (pp. 93–115). Springer.

  • Ehmann, A.F., Downie, J.S., Jones, M.C. (2007). The music information retrieval evaluation exchange “Do-It-Yourself” web service. In International conference on music information retrieval (pp. 323–324).

  • Geman, S., Bienenstock, E., Doursat, R. (1992). Neural networks and the bias/variance dilema. Neural Computation, 4(1), 1–58.

    Article  Google Scholar 

  • Goto, M., Hashiguchi, H., Nishimura, T., Oka, R. (2003). RWC music database: Popular, classical and jazz music databases. In International conference on music information retrieval (pp. 287–288).

  • Guiver, J., Mizzaro, S., Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 27(4), 1–26.

    Article  Google Scholar 

  • Harman, D.K. (2011). Information retrieval evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119.

    Article  Google Scholar 

  • Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., Olson, D. (2000) Do batch and user evaluations give the same results? In International ACM SIGIR conference on research and development in information retrieval (pp. 17–24).

  • Hu, X., & Kando, N. (2012). User-centered measures vs. system effectiveness in finding similar songs. In International society for music information retrieval conference (pp. 331–336).

  • Hu, X., Downie, J.S., Laurier, C., Bay, M., Ehmann, A.F. (2008). The 2007 MIREX Audio mood classification task: Lessons learned. In International conference on music information retrieval.

  • Huffman, S.B., & Hochster, M. (2007). How well does result relevance predict session satisfaction? In International ACM SIGIR conference on research and development in information retrieval (pp. 567–573).

  • Ipeirotis, P.G., Provost, F., Wang, J. (2010). Quality management on Amazon mechanical turk. In ACM SIGKDD workshop on human computation (pp. 64–67).

  • Järvelin, K. (2011). IR research: Systems, interaction, evaluation and theories. ACM SIGIR Forum, 45(2), 17–31.

    Article  Google Scholar 

  • Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.

    Article  Google Scholar 

  • Jones, M.C., Downie, J.S., Ehmann, A.F. (2007). Human similarity judgments: implications for the design of formal evaluations. In International conference on music information retrieval (pp. 539–542).

  • Kanoulas, E., & Aslam, J.A. (2009). Empirical justification of the gain and discount function for nDCG. In ACM International conference on information and knowledge management (pp. 611–620).

  • Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.

    Article  Google Scholar 

  • Kittur, A., Chi, E.H., Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Annual ACM SIGCHI conference on human factors in computing systems (pp. 453–456).

  • Lancaster, F. (1968). Evaluation of the MEDLARS Demand Search Service. Tech. rep., U.S. Department of Health, Education, and Welfare.

  • Lartillot, O., Miotto, R., Montecchio, N., Orio, N., Rizo, D., Schedl, M. (2011). MusiClef: A benchmark activity in multimodal music information retrieval. In International society for music information retrieval conference.

  • Law, E.L., von Ahn, L., Dannenberg, R.B., Crawford, M. (2007). TagATune: A game for music and sound annotation. In International conference on music information retrieval (pp. 361–364).

  • Lee, J.H. (2010). Crowdsourcing music similarity judgments using mechanical turk. In International society for music information retrieval conference (pp. 183–188).

  • Lehmann, E., & Casella, G. (1998). Theory of Point Estimation. Springer.

  • Lesk, M., Harman, D.K., Fox, E.A., Wu, H., Buckley, C. (1997). The SMART lab report. ACM SIGIR Forum, 31(1), 2–22.

    Article  Google Scholar 

  • Mayer, R., & Rauber, A. (2012). Towards time-resilient MIR processes. In International society for music information retrieval conference (pp. 337–342).

  • McFee, B., Bertin-Mahieux, T., Ellis, D.P., Lanckriet, G. (2012). The million song dataset challenge. In WWW international workshop on advances in music information research (pp. 909–916).

  • Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1).

  • Moffat, A., Zobel, J., Hawking, D. (2005). Recommended reading for IR research students. ACM SIGIR Forum, 39(2), 3–14.

    Article  Google Scholar 

  • Moffat, A., Webber, W., Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 375–382).

  • Niedermayer, B., Widmer, G., Böck, S. (2011). On the importance of real audio data for MIR algorithm evaluation at the note-level: a comparative study. In International society for music information retrieval conference.

  • Page, K., Fields, B., de Roure, D., Crawford, T., Downie, J.S. (2012). Reuse, remix, repeat: the workflows of MIR. In International society for music information retrieval conference (pp. 409–414).

  • Peeters, G., Fort, K. (2012). Towards a (better) definition of the description of annotated MIR Corpora. In International society for music information retrieval conference (pp. 25–30).

  • Peeters, G., Urbano, J., Jones, G.J. (2012). Notes from the ISMIR 2012 late-breaking session on evaluation in music information retrieval. In International society for music information retrieval conference.

  • Poibeau, T., & Kosseim, L. (2001). Proper name extraction from non-journalistic texts. Language and Computers - Studies in Practical Linguistics, 37, 144–157.

    Google Scholar 

  • Poliner, G.E., Ellis, D.P., Ehmann, A.F., Gómez, E., Streich, S., Ong, B. (2007). Melody transcription from music audio: approaches and evaluation. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1247–1256.

    Article  Google Scholar 

  • Rauber, A., Schindler, A., Mayer, R. (2012) Facilitating comprehensive benchmarking experiments on the million song dataset. In International Society for Music Information Retrieval Conference (pp. 469–474).

  • Robertson, S. (2008). On the history of evaluation in IR. Journal of Information Science, 34(4), 439–456.

    Article  Google Scholar 

  • Robertson, S. (2011). On the contributions of topics to system evaluation. In European conference on information retrieval (pp. 129–140).

  • Robertson, S., Kanoulas, E., Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 603–610)

  • Rzeszotarski, J., & Kittur, A. (2011) Instrumenting the crowd: using implicit behavioral measures to predict task performance. In ACM Symposium on User Interface Software and Technology.

  • Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval (pp. 525–532).

  • Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.

    Article  Google Scholar 

  • Sakai, T., Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.

    Article  Google Scholar 

  • Salamon J., Urbano J. (2012). Current challenges in the evaluation of predominant melody extraction algorithms. In International society for music information retrieval conference (pp. 289–294).

  • Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.

    Article  MATH  Google Scholar 

  • Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval (pp. 162–169).

  • Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E. (2010). Do user preferences and evaluation measures line up? In International ACM SIGIR conference on research and development in information retrieval (pp. 555–562).

  • Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In International ACM SIGIR conference on research and development in information retrieval (pp. 138–146).

  • Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3–48.

    Google Scholar 

  • Schedl, M., & Flexer, A. (2012). Putting the user in the center of music information retrieval. In International society for music information retrieval conference (pp. 385–390).

  • Schedl, M., Stober, S., Gómez, E., Orio, N., Liem, C.C. (2012). User-aware music retrieval. In Müller, M., Goto, M., Schedl, M. (Eds.), Multimodal music processing, dagstuhl publishing (pp. 135–156).

  • Scholer, F., & Turpin, A. (2008). Relevance thresholds in system evaluations. In International ACM SIGIR conference on research and development in information retrieval (pp. 693–694).

  • Shadish, W.R., Cook, T.D., Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin.

  • Smucker, M.D., & Clarke, C.L. (2012a). The fault, dear researchers, is not in Cranfield, but in our metrics, that they are unrealistic. In European workshop on human-computer interaction and information retrieval (pp. 11–12).

  • Smucker, M.D., & Clarke, C.L. (2012b). Time-based calibration of effectiveness measures. In International ACM SIGIR conference on research and development in information retrieval (pp. 95–104).

  • Smucker, M.D., Allan, J., Carterette, B. (2007). A. Comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management (pp. 623–632).

  • Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y. (2008). Cheap and FastBut is it good? evaluating non-expert annotations for natural language tasks. In Conference on empirical methods in natural language processing (pp. 254–263).

  • Soboroff, I., Nicholas, C., Cahan, P. (2001). Ranking retrieval systems without relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 66–73).

  • Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.

    Article  Google Scholar 

  • Taylor, J.R. (1997). An Introduction Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books.

  • Trochim, W.M., Donnelly, J.P. (2007). The Research Methods Knowledge Base, 3rd edn. Atomic Dog Publishing.

  • Turpin, A., Hersh, W. (2001). Why batch and user evaluations do not give the same results. In International ACM SIGIR conference on research and development in information retrieval (pp. 225–231).

  • Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C. (2005). A. Ground truth for half a million musical incipits. Journal of Digital Information Management, 3(1), 34–39.

    Google Scholar 

  • Typke, R., Veltkamp, R.C., Wiering, F. (2006). A. Measure for evaluating retrieval techniques based on partially ordered ground truth lists. In IEEE International Conference on Multimedia and Expo (pp. 1793–1796).

  • Urbano, J., Schedl, M. (2013). Minimal test collections for low-cost evaluation of audio music similarity and retrieval systems. International Journal of Multimedia Information Retrieval, 2(1), 59–70.

    Article  Google Scholar 

  • Urbano, J., Marrero, M., Martín, D., Lloréns, J. (2010a). Improving the generation of ground truths based on partially ordered lists. In International society for music information retrieval conference (pp. 285–290).

  • Urbano, J., Morato, J., Marrero, M., Martín, D. (2010b). Crowdsourcing preference judgments for evaluation of music similarity tasks. In ACM SIGIR workshop on crowdsourcing for search evaluation (pp. 9–16).

  • Urbano, J., Marrero, M., Martín, D., Morato, J., Robles, K., Lloréns, J. (2011a). The University Carlos III of Madrid at TREC 2011 crowdsourcing track. In Text REtrieval conference.

  • Urbano, J., Martín, D., Marrero, M., Morato, J. (2011b). Audio music similarity and retrieval: evaluation power and stability. In International society for music information retrieval conference (pp. 597–602).

  • Urbano, J., Downie, J.S., Mcfee, B., Schedl, M. (2012). How significant is statistically significant? the case of audio music similarity and retrieval. In International society for music information retrieval conference (pp. 1181–186).

  • Voorhees, E.M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5), 697–716.

    Article  Google Scholar 

  • Voorhees, E.M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval (pp. 74–82).

  • Voorhees, E.M. (2002a). The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum (pp. 355–370).

  • Voorhees, E.M. (2002b). Whither music IR evaluation infrastructure: lessons to be learned from TREC. In JCDL workshop on the creation of standardized test collections, tasks, and metrics for music information retrieval (MIR) and music digital library (MDL) evaluation (pp. 7–13).

  • Voorhees, E.M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval (pp. 316–323).

  • Voorhees, E.M., & Harman, D.K. (2005). TREC: Experiment and Evaluation in Information Retrieval. MIT Press.

  • Webber, W., Moffat, A., Zobel, J. (2008a). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management (pp. 571–580).

  • Webber, W., Moffat, A., Zobel, J., Sakai, T. (2008b). Precision-at-ten considered redundant. In International ACM SIGIR conference on research and development in information retrieval (pp. 695–696).

  • Yilmaz, E., Aslam, J.A. (2006). Estimating average precision with incomplete and imperfect information. In ACM international conference on information and knowledge management (pp. 102–111).

  • Yilmaz, E., Kanoulas, E., Aslam, J.A. (2008). A. Simple and efficient sampling method for estimating AP and NDCG. In International ACM SIGIR conference on research and development in information retrieval (pp. 603–610).

  • Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval (pp. 307–314).

  • Zobel, J., Webber, W., Sanderson, M., Moffat, A. (2011). Principles for robust evaluation infrastructure. In ACM CIKM workshop on data infrastructures for supporting information retrieval evaluation.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julián Urbano.

Additional information

M. Schedl is supported by the Austrian Science Fund (FWF): P22856.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Urbano, J., Schedl, M. & Serra, X. Evaluation in Music Information Retrieval. J Intell Inf Syst 41, 345–369 (2013). https://doi.org/10.1007/s10844-013-0249-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0249-4

Keywords

Navigation