Skip to main content

Advertisement

SpringerLink
  • Log in
Book cover

European Conference on Machine Learning

ECML 2005: Machine Learning: ECML 2005 pp 564–575Cite as

  1. Home
  2. Machine Learning: ECML 2005
  3. Conference paper
Counting Positives Accurately Despite Inaccurate Classification

Counting Positives Accurately Despite Inaccurate Classification

  • George Forman23 
  • Conference paper
  • 5227 Accesses

  • 42 Citations

Part of the Lecture Notes in Computer Science book series (LNAI,volume 3720)

Abstract

Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set—quantification—as opposed to classifying individual cases accuratel y. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set—a common case in information retrieval.

Keywords

  • Support Vector Machine
  • Feature Selection
  • Mixture Model
  • Class Distribution
  • Binary Classifier

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Download conference paper PDF

References

  1. Bennett, P.: Using Asymmetric Distributions to Improve Text Classifier Probability Estimates. In: Proc. ACM SIGIR Conference on Research and Development in Information Retrieval (July/August 2003)

    Google Scholar 

  2. Fawcett, T.: ROC graphs: Notes and practical considerations for data mining researchers. Tech. report HPL-2003-4. Hewlett-Packard Laboratories, Palo Alto, CA, USA (2003)

    Google Scholar 

  3. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    CrossRef  MATH  Google Scholar 

  4. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI/ICML Workshop on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  5. Weiss, G., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. J. of Artificial Intelligence Research 19, 315–354 (2003)

    MATH  Google Scholar 

  6. Witten, I.H., Eibe Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Hewlett-Packard Labs, Palo Alto, CA, 94304, USA

    George Forman

Authors
  1. George Forman
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Faculty of Economics of the University of Porto, Portugal

    João Gama

  2. Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal

    Rui Camacho

  3. LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal

    Pavel B. Brazdil

  4. LIACC/FEP, Universidade do Porto, Portugal

    Alípio Mário Jorge

  5. LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6., 4050-190, Porto, Portugal

    Luís Torgo

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Forman, G. (2005). Counting Positives Accurately Despite Inaccurate Classification. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_55

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/11564096_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29243-2

  • Online ISBN: 978-3-540-31692-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Over 10 million scientific documents at your fingertips

Switch Edition
  • Academic Edition
  • Corporate Edition
  • Home
  • Impressum
  • Legal information
  • Privacy statement
  • California Privacy Statement
  • How we use cookies
  • Manage cookies/Do not sell my data
  • Accessibility
  • FAQ
  • Contact us
  • Affiliate program

Not logged in - 34.239.152.207

Not affiliated

Springer Nature

© 2023 Springer Nature Switzerland AG. Part of Springer Nature.