Journal of Intelligent Information Systems

, Volume 41, Issue 3, pp 345–369

Evaluation in Music Information Retrieval

Article

Abstract

The field of Music Information Retrieval has always acknowledged the need for rigorous scientific evaluations, and several efforts have set out to develop and provide the infrastructure, technology and methodologies needed to carry out these evaluations. The community has enormously gained from these evaluation forums, but we have reached a point where we are stuck with evaluation frameworks that do not allow us to improve as much and as well as we want. The community recently acknowledged this problem and showed interest in addressing it, though it is not clear what to do to improve the situation. We argue that a good place to start is again the Text IR field. Based on a formalization of the evaluation process, this paper presents a survey of past evaluation work in the context of Text IR, from the point of view of validity, reliability and efficiency of the experiments. We show the problems that our community currently has in terms of evaluation, point to several lines of research to improve it and make various proposals in that line.

Keywords

Music Information Retrieval Text Information Retrieval Evaluation and Experimentation Survey 

References

  1. Al-Maskari, A., Sanderson, M., Clough, P. (2007). The relationship between IR effectiveness measures and user satisfaction. In International ACM SIGIR conference on research and development in information retrieval (pp. 773–774).Google Scholar
  2. Allan, J. & Croft, B. (2003). Challenges in information retrieval and language modeling. ACM SIGIR Forum, 37(1), 31–47.CrossRefGoogle Scholar
  3. Allan, J., Carterette, B., Lewis, J. (2005). When will information retrieval be ’good enough’? In International ACM SIGIR conference on research and development in information retrieval (pp. 433–440).Google Scholar
  4. Allan, J., Croft, B., Moffat, A., Sanderson, M. (2012). Frontiers, challenges and opportunities for information retrieval: report from SWIRL 2012. ACM SIGIR Forum, 46(1), 2–32.CrossRefGoogle Scholar
  5. Alonso, O. & Mizzaro, S. (2012). Using crowdsourcing for T.R.E.C. relevance assessment. Information Processing and Management, 48(6), 1053–1066.CrossRefGoogle Scholar
  6. Armstrong, T.G., Moffat, A., Webber, W., Zobel, J. (2009). Improvements that don’t add up: ad-hoc retrieval results since 1998. In ACM international conference on information and knowledge management (pp. 601–610).Google Scholar
  7. Aslam, J.A. & Yilmaz, E. (2007). Inferring document relevance from incomplete information. In ACM international conference on information and knowledge management (pp. 633–642).Google Scholar
  8. Aslam, J.A., Pavlu, V., Savell, R. (2003). A. unified model for metasearch, pooling and system evaluation. In ACM International conference on information and knowledge management (pp. 484–491).Google Scholar
  9. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval (pp. 667–674).Google Scholar
  10. Bennett, P.N., Carterette, B., Chapelle, O., Joachims, T. (2008). Beyond binary relevance: preferences, diversity and set-level judgments. ACM SIGIR Forum, 42(2), 53–58.CrossRefGoogle Scholar
  11. Bertin-Mahieux, T., Ellis, D.P., Whitman, B., Lamere, P. (2011). The million song dataset. In International society for music information retrieval conference.Google Scholar
  12. Bodoff, D. & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval (pp. 367–374).Google Scholar
  13. Buckley, C. & Voorhees, E.M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval (pp. 33–34).Google Scholar
  14. Buckley, C. & Voorhees, E.M. (2004). Retrieval evaluation with incomplete information. In International ACM SIGIR conference on research and development in information retrieval (pp. 25–32).Google Scholar
  15. Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.CrossRefGoogle Scholar
  16. Cano, P., Gómez, E., Gouyon, F., Herrera, P., Koppenberger, M., Ong, B., Serra, X., Streich, S., Wack, N. (2006). ISMIR 2004 Audio Description Contest. Tech. Rep. MTG-TR-2006-02, Universitat Pompeu Fabra.Google Scholar
  17. Carterette, B. (2007). Robust test collections for retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 55–62).Google Scholar
  18. Carterette, B. (2011). System effectiveness, user models, and user utility: a general framework for investigation. In International ACM SIGIR conference on research and development in information retrieval (pp. 903–912).Google Scholar
  19. Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1).Google Scholar
  20. Carterette, B., & Allan, J. (2007). Semiautomatic evaluation of retrieval systems using document similarities. In ACM international conference on information and knowledge management (pp. 873–876).Google Scholar
  21. Carterette, B., & Smucker, M.D. (2007). Hypothesis testing with incomplete relevance judgments. In ACM international conference on information and knowledge management (pp. 643–652).Google Scholar
  22. Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 539–546).Google Scholar
  23. Carterette, B., Allan, J., Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 268–275).Google Scholar
  24. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J. (2009). If I had a million queries. In European conference on information retrieval (pp. 288–300).Google Scholar
  25. Carterette, B., Gabrilovich, E., Josifovski, V., Metzler, D. (2010). Measuring the reusability of test collections. In ACM International conference on web search and data mining (pp. 231–240)Google Scholar
  26. Carterette, B., Kanoulas, E., Pavlu, V., Fang, H. (2010). Reusable test collections through experimental design. In International ACM SIGIR conference on research and development in information retrieval (pp. 547–554).Google Scholar
  27. Carvalho, V.R., Lease, M., Yilmaz, E. (2010). Crowdsourcing for search evaluation. ACM SIGIR Forum 44(2), 17–22.CrossRefGoogle Scholar
  28. Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In ACM international conference on information and knowledge management (pp. 621–630).Google Scholar
  29. Cleverdon, C.W. (1991). The significance of the cranfield tests on index languages. In International ACM SIGIR conference on research and development in information retrieval (pp. 3–12).Google Scholar
  30. Cormack, G.V., & Lynam, T.R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval (pp. 533–540).Google Scholar
  31. Cormack, G.V., Palmer, C.R., Clarke, C.L. (1998). Efficient construction of large test collections. In International ACM SIGIR conference on research and development in information retrieval (pp. 282–289).Google Scholar
  32. Cunningham, S.J., Bainbridge, D., Downie, J.S. (2012). The impact of MIREX on scholarly research (2005–2010). In International society for music information retrieval conference (pp. 259–264).Google Scholar
  33. Downie, J.S. (2002). Interim report on establishing MIR/MDL evaluation frameworks: Commentary on consensus building. In ISMIR panel on music information retrieval evaluation frameworks (pp. 43–44).Google Scholar
  34. Downie, J.S. (2003). The MIR/MDL Evaluation Project White Paper Collection, 3rd edn. http://www.music-ir.org/evaluation/wp.html.
  35. Downie, J.S. (2004). The scientific evaluation of music information retrieval systems: Foundations and future. Computer Music Journal, 28(2), 12–23.CrossRefGoogle Scholar
  36. Downie, J.S., Bay, M., Ehmann, A.F., Jones, M.C. (2008). Audio cover song identification: MIREX 2006–2007 results and analysis. In International conference on music information retrieval.Google Scholar
  37. Downie, J.S., Ehmann, A.F., Bay, M., Jones, M.C. (2010). The music information retrieval evaluation exchange: Some observations and insights. In Zbigniew, W.R., Wieczorkowska, A.A. (Eds.), Advances in music information retrieval (pp. 93–115). Springer.Google Scholar
  38. Ehmann, A.F., Downie, J.S., Jones, M.C. (2007). The music information retrieval evaluation exchange “Do-It-Yourself” web service. In International conference on music information retrieval (pp. 323–324).Google Scholar
  39. Geman, S., Bienenstock, E., Doursat, R. (1992). Neural networks and the bias/variance dilema. Neural Computation, 4(1), 1–58.CrossRefGoogle Scholar
  40. Goto, M., Hashiguchi, H., Nishimura, T., Oka, R. (2003). RWC music database: Popular, classical and jazz music databases. In International conference on music information retrieval (pp. 287–288).Google Scholar
  41. Guiver, J., Mizzaro, S., Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 27(4), 1–26.CrossRefGoogle Scholar
  42. Harman, D.K. (2011). Information retrieval evaluation. Synthesis Lectures on Information Concepts, Retrieval, and Services, 3(2), 1–119.CrossRefGoogle Scholar
  43. Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., Olson, D. (2000) Do batch and user evaluations give the same results? In International ACM SIGIR conference on research and development in information retrieval (pp. 17–24).Google Scholar
  44. Hu, X., & Kando, N. (2012). User-centered measures vs. system effectiveness in finding similar songs. In International society for music information retrieval conference (pp. 331–336).Google Scholar
  45. Hu, X., Downie, J.S., Laurier, C., Bay, M., Ehmann, A.F. (2008). The 2007 MIREX Audio mood classification task: Lessons learned. In International conference on music information retrieval.Google Scholar
  46. Huffman, S.B., & Hochster, M. (2007). How well does result relevance predict session satisfaction? In International ACM SIGIR conference on research and development in information retrieval (pp. 567–573).Google Scholar
  47. Ipeirotis, P.G., Provost, F., Wang, J. (2010). Quality management on Amazon mechanical turk. In ACM SIGKDD workshop on human computation (pp. 64–67).Google Scholar
  48. Järvelin, K. (2011). IR research: Systems, interaction, evaluation and theories. ACM SIGIR Forum, 45(2), 17–31.CrossRefGoogle Scholar
  49. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.CrossRefGoogle Scholar
  50. Jones, M.C., Downie, J.S., Ehmann, A.F. (2007). Human similarity judgments: implications for the design of formal evaluations. In International conference on music information retrieval (pp. 539–542).Google Scholar
  51. Kanoulas, E., & Aslam, J.A. (2009). Empirical justification of the gain and discount function for nDCG. In ACM International conference on information and knowledge management (pp. 611–620).Google Scholar
  52. Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.CrossRefGoogle Scholar
  53. Kittur, A., Chi, E.H., Suh, B. (2008). Crowdsourcing user studies with mechanical turk. In Annual ACM SIGCHI conference on human factors in computing systems (pp. 453–456).Google Scholar
  54. Lancaster, F. (1968). Evaluation of the MEDLARS Demand Search Service. Tech. rep., U.S. Department of Health, Education, and Welfare.Google Scholar
  55. Lartillot, O., Miotto, R., Montecchio, N., Orio, N., Rizo, D., Schedl, M. (2011). MusiClef: A benchmark activity in multimodal music information retrieval. In International society for music information retrieval conference.Google Scholar
  56. Law, E.L., von Ahn, L., Dannenberg, R.B., Crawford, M. (2007). TagATune: A game for music and sound annotation. In International conference on music information retrieval (pp. 361–364).Google Scholar
  57. Lee, J.H. (2010). Crowdsourcing music similarity judgments using mechanical turk. In International society for music information retrieval conference (pp. 183–188).Google Scholar
  58. Lehmann, E., & Casella, G. (1998). Theory of Point Estimation. Springer.Google Scholar
  59. Lesk, M., Harman, D.K., Fox, E.A., Wu, H., Buckley, C. (1997). The SMART lab report. ACM SIGIR Forum, 31(1), 2–22.CrossRefGoogle Scholar
  60. Mayer, R., & Rauber, A. (2012). Towards time-resilient MIR processes. In International society for music information retrieval conference (pp. 337–342).Google Scholar
  61. McFee, B., Bertin-Mahieux, T., Ellis, D.P., Lanckriet, G. (2012). The million song dataset challenge. In WWW international workshop on advances in music information research (pp. 909–916).Google Scholar
  62. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1).Google Scholar
  63. Moffat, A., Zobel, J., Hawking, D. (2005). Recommended reading for IR research students. ACM SIGIR Forum, 39(2), 3–14.CrossRefGoogle Scholar
  64. Moffat, A., Webber, W., Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 375–382).Google Scholar
  65. Niedermayer, B., Widmer, G., Böck, S. (2011). On the importance of real audio data for MIR algorithm evaluation at the note-level: a comparative study. In International society for music information retrieval conference.Google Scholar
  66. Page, K., Fields, B., de Roure, D., Crawford, T., Downie, J.S. (2012). Reuse, remix, repeat: the workflows of MIR. In International society for music information retrieval conference (pp. 409–414).Google Scholar
  67. Peeters, G., Fort, K. (2012). Towards a (better) definition of the description of annotated MIR Corpora. In International society for music information retrieval conference (pp. 25–30).Google Scholar
  68. Peeters, G., Urbano, J., Jones, G.J. (2012). Notes from the ISMIR 2012 late-breaking session on evaluation in music information retrieval. In International society for music information retrieval conference.Google Scholar
  69. Poibeau, T., & Kosseim, L. (2001). Proper name extraction from non-journalistic texts. Language and Computers - Studies in Practical Linguistics, 37, 144–157.Google Scholar
  70. Poliner, G.E., Ellis, D.P., Ehmann, A.F., Gómez, E., Streich, S., Ong, B. (2007). Melody transcription from music audio: approaches and evaluation. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1247–1256.CrossRefGoogle Scholar
  71. Rauber, A., Schindler, A., Mayer, R. (2012) Facilitating comprehensive benchmarking experiments on the million song dataset. In International Society for Music Information Retrieval Conference (pp. 469–474).Google Scholar
  72. Robertson, S. (2008). On the history of evaluation in IR. Journal of Information Science, 34(4), 439–456.CrossRefGoogle Scholar
  73. Robertson, S. (2011). On the contributions of topics to system evaluation. In European conference on information retrieval (pp. 129–140).Google Scholar
  74. Robertson, S., Kanoulas, E., Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 603–610)Google Scholar
  75. Rzeszotarski, J., & Kittur, A. (2011) Instrumenting the crowd: using implicit behavioral measures to predict task performance. In ACM Symposium on User Interface Software and Technology.Google Scholar
  76. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval (pp. 525–532).Google Scholar
  77. Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.CrossRefGoogle Scholar
  78. Sakai, T., Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.CrossRefGoogle Scholar
  79. Salamon J., Urbano J. (2012). Current challenges in the evaluation of predominant melody extraction algorithms. In International society for music information retrieval conference (pp. 289–294).Google Scholar
  80. Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATHGoogle Scholar
  81. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval (pp. 162–169).Google Scholar
  82. Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E. (2010). Do user preferences and evaluation measures line up? In International ACM SIGIR conference on research and development in information retrieval (pp. 555–562).Google Scholar
  83. Saracevic, T. (1995). Evaluation of evaluation in information retrieval. In International ACM SIGIR conference on research and development in information retrieval (pp. 138–146).Google Scholar
  84. Schamber, L. (1994). Relevance and information behavior. Annual Review of Information Science and Technology, 29, 3–48.Google Scholar
  85. Schedl, M., & Flexer, A. (2012). Putting the user in the center of music information retrieval. In International society for music information retrieval conference (pp. 385–390).Google Scholar
  86. Schedl, M., Stober, S., Gómez, E., Orio, N., Liem, C.C. (2012). User-aware music retrieval. In Müller, M., Goto, M., Schedl, M. (Eds.), Multimodal music processing, dagstuhl publishing (pp. 135–156).Google Scholar
  87. Scholer, F., & Turpin, A. (2008). Relevance thresholds in system evaluations. In International ACM SIGIR conference on research and development in information retrieval (pp. 693–694).Google Scholar
  88. Shadish, W.R., Cook, T.D., Campbell, D.T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin.Google Scholar
  89. Smucker, M.D., & Clarke, C.L. (2012a). The fault, dear researchers, is not in Cranfield, but in our metrics, that they are unrealistic. In European workshop on human-computer interaction and information retrieval (pp. 11–12).Google Scholar
  90. Smucker, M.D., & Clarke, C.L. (2012b). Time-based calibration of effectiveness measures. In International ACM SIGIR conference on research and development in information retrieval (pp. 95–104).Google Scholar
  91. Smucker, M.D., Allan, J., Carterette, B. (2007). A. Comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management (pp. 623–632).Google Scholar
  92. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y. (2008). Cheap and FastBut is it good? evaluating non-expert annotations for natural language tasks. In Conference on empirical methods in natural language processing (pp. 254–263).Google Scholar
  93. Soboroff, I., Nicholas, C., Cahan, P. (2001). Ranking retrieval systems without relevance judgments. In International ACM SIGIR conference on research and development in information retrieval (pp. 66–73).Google Scholar
  94. Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.CrossRefGoogle Scholar
  95. Taylor, J.R. (1997). An Introduction Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books.Google Scholar
  96. Trochim, W.M., Donnelly, J.P. (2007). The Research Methods Knowledge Base, 3rd edn. Atomic Dog Publishing.Google Scholar
  97. Turpin, A., Hersh, W. (2001). Why batch and user evaluations do not give the same results. In International ACM SIGIR conference on research and development in information retrieval (pp. 225–231).Google Scholar
  98. Typke, R., den Hoed, M., de Nooijer, J., Wiering, F., Veltkamp, R.C. (2005). A. Ground truth for half a million musical incipits. Journal of Digital Information Management, 3(1), 34–39.Google Scholar
  99. Typke, R., Veltkamp, R.C., Wiering, F. (2006). A. Measure for evaluating retrieval techniques based on partially ordered ground truth lists. In IEEE International Conference on Multimedia and Expo (pp. 1793–1796).Google Scholar
  100. Urbano, J., Schedl, M. (2013). Minimal test collections for low-cost evaluation of audio music similarity and retrieval systems. International Journal of Multimedia Information Retrieval, 2(1), 59–70.CrossRefGoogle Scholar
  101. Urbano, J., Marrero, M., Martín, D., Lloréns, J. (2010a). Improving the generation of ground truths based on partially ordered lists. In International society for music information retrieval conference (pp. 285–290).Google Scholar
  102. Urbano, J., Morato, J., Marrero, M., Martín, D. (2010b). Crowdsourcing preference judgments for evaluation of music similarity tasks. In ACM SIGIR workshop on crowdsourcing for search evaluation (pp. 9–16).Google Scholar
  103. Urbano, J., Marrero, M., Martín, D., Morato, J., Robles, K., Lloréns, J. (2011a). The University Carlos III of Madrid at TREC 2011 crowdsourcing track. In Text REtrieval conference.Google Scholar
  104. Urbano, J., Martín, D., Marrero, M., Morato, J. (2011b). Audio music similarity and retrieval: evaluation power and stability. In International society for music information retrieval conference (pp. 597–602).Google Scholar
  105. Urbano, J., Downie, J.S., Mcfee, B., Schedl, M. (2012). How significant is statistically significant? the case of audio music similarity and retrieval. In International society for music information retrieval conference (pp. 1181–186).Google Scholar
  106. Voorhees, E.M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5), 697–716.CrossRefGoogle Scholar
  107. Voorhees, E.M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval (pp. 74–82).Google Scholar
  108. Voorhees, E.M. (2002a). The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum (pp. 355–370).Google Scholar
  109. Voorhees, E.M. (2002b). Whither music IR evaluation infrastructure: lessons to be learned from TREC. In JCDL workshop on the creation of standardized test collections, tasks, and metrics for music information retrieval (MIR) and music digital library (MDL) evaluation (pp. 7–13).Google Scholar
  110. Voorhees, E.M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval (pp. 316–323).Google Scholar
  111. Voorhees, E.M., & Harman, D.K. (2005). TREC: Experiment and Evaluation in Information Retrieval. MIT Press.Google Scholar
  112. Webber, W., Moffat, A., Zobel, J. (2008a). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management (pp. 571–580).Google Scholar
  113. Webber, W., Moffat, A., Zobel, J., Sakai, T. (2008b). Precision-at-ten considered redundant. In International ACM SIGIR conference on research and development in information retrieval (pp. 695–696).Google Scholar
  114. Yilmaz, E., Aslam, J.A. (2006). Estimating average precision with incomplete and imperfect information. In ACM international conference on information and knowledge management (pp. 102–111).Google Scholar
  115. Yilmaz, E., Kanoulas, E., Aslam, J.A. (2008). A. Simple and efficient sampling method for estimating AP and NDCG. In International ACM SIGIR conference on research and development in information retrieval (pp. 603–610).Google Scholar
  116. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval (pp. 307–314).Google Scholar
  117. Zobel, J., Webber, W., Sanderson, M., Moffat, A. (2011). Principles for robust evaluation infrastructure. In ACM CIKM workshop on data infrastructures for supporting information retrieval evaluation.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity Carlos III of MadridLeganésSpain
  2. 2.Department of Computational PerceptionJohannes Kepler UniversityLinzAustria
  3. 3.Music Technology GroupUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations