A Comparison of Evaluation Metrics for Document Filtering

  • Enrique Amigó
  • Julio Gonzalo
  • Felisa Verdejo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6941)


Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. We present a unified, comparative view of the strenghts and weaknesses of proposed measures based on two formal constraints (which should be satisfied by any suitable evaluation measure) and various properties (which help differentiating measures according to their behaviour). We conclude that (i) some smoothing process is necessary process to satisfy the basic constraints; and (ii) metrics can be grouped into three families, each satisfying one out of three formal properties, which are mutually exclusive, i.e. no metric can satisfy all three properties simultaneously.


Relevant Document Evaluation Metrics Input Stream Concept Drift Informative System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., Corujo, A.: WePS3 Evaluation Campaign: Overview of the On-line Reputation Management Task. In: 2nd Web People Search Evaluation Workshop (WePS 2010), CLEF 2010 Conference, Padova Italy (2010)Google Scholar
  2. 2.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. CoRR cs.CL/0006013 (2000)Google Scholar
  3. 3.
    Callan, J.: Document filtering with inference networks. In: Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 262–269 (1996)Google Scholar
  4. 4.
    Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20(1), 37 (1960)CrossRefGoogle Scholar
  5. 5.
    Cormack, G., Lynam, T.: Trec 2005 spam track overview. In: Proceedings of the fourteenth Text Retrieval Conference 8TREC 2005 (2005)Google Scholar
  6. 6.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A case-based approach to spam filtering that can track concept drift. In: The ICCBR 2003 Workshop on Long-Lived CBR Systems, pp. 03–2003 (2003)Google Scholar
  7. 7.
    Fawcett, T., Niculescu-Mizil, A.: Pav and the roc convex hull. Mach. Learn. 68, 97–106 (2007)CrossRefGoogle Scholar
  8. 8.
    Good, I.J.: ational decisions. Journal of the Royal Statistical Society. Series B Methodological 14, 107–114 (1952)MathSciNetGoogle Scholar
  9. 9.
    Hedin, B., Tomlinson, S., Baron, J.R., Oard, D.W.: Overview of the trec 2009 legal track (2009)Google Scholar
  10. 10.
    Hoashi, K., Matsumoto, K., Inoue, N., Hashimoto, K.: Document filtering method using non-relevant information profile. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 176–183. ACM, New York (2000), Google Scholar
  11. 11.
    Hull, D.A.: The trec-6 filtering track: Description and analysis. In: Proceedings of the TREC 6, pp. 33–56 (1997)Google Scholar
  12. 12.
    Hull, D.A.: The TREC-7 filtering track: description and analysis. In: Voorhees, E.M., Harman, D.K. (eds.) Proceedings of TREC-7, 7th Text Retrieval Conference, pp. 33–56. National Institute of Standards and Technology, Gaithersburg (1998),
  13. 13.
    Karon, B.P., Alexander, I.E.: Association and estimation in contingency tables. Journal of the American Statistical Association 23(2), 1–28 (1958), MathSciNetGoogle Scholar
  14. 14.
    Ling, C.X., Huang, J., Zhang, H.: Auc: a statistically consistent and more discriminating measure than accuracy. In: IJCAI, pp. 519–526 (2003)Google Scholar
  15. 15.
    Mitchell, T.M.: Machine learning. McGraw Hill, New York (1997)zbMATHGoogle Scholar
  16. 16.
    Persin, M.: Document filtering for fast ranking. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 339–348. Springer, New York (1994), Google Scholar
  17. 17.
    Provost, F.J., Fawcett, T.: Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: Knowledge Discovery and Data Mining, pp. 43–48 (1997)Google Scholar
  18. 18.
    Qi, H., Yang, M., He, X., Li, S.: Re-examination on lam% in spam filtering. In: Proceedings of the SIGIR 2010 Conference, Geneva, Switzerland (2010)Google Scholar
  19. 19.
    Robertson, S., Hull, D.A.: The trec-9 filtering track final report. In: Proceedings of TREC-9, pp. 25–40 (2001)Google Scholar
  20. 20.
    Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR, pp. 215–223. ACM Press, New York (1998)Google Scholar
  21. 21.
    Sokolova, M.V., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In: Sattar, A., Kang, B.-h. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  22. 22.
    Van Rijsbergen, C.: Foundation of evaluation. Journal of Documentation 30(4), 365–373 (1974)CrossRefGoogle Scholar
  23. 23.
    Wei, C.P., Chen, H.C., Cheng, T.H.: Effective spam filtering: A single-class learning and ensemble approach. Decis. Support Syst. 45(3), 491–503 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Enrique Amigó
    • 1
  • Julio Gonzalo
    • 1
  • Felisa Verdejo
    • 1
  1. 1.UNED NLP & IR GroupMadridSpain

Personalised recommendations