Improving Generalization by Data Categorization

Li, Ling; Pratap, Amrit; Lin, Hsuan-Tien; Abu-Mostafa, Yaser S.

doi:10.1007/11564126_19

Ling Li²³,
Amrit Pratap²³,
Hsuan-Tien Lin²³ &
…
Yaser S. Abu-Mostafa²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3721))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

3082 Accesses
10 Citations

Abstract

In most of the learning algorithms, examples in the training set are treated equally. Some examples, however, carry more reliable or critical information about the target than the others, and some may carry wrong information. According to their intrinsic margin, examples can be grouped into three categories: typical, critical, and noisy. We propose three methods, namely the selection cost, SVM confidence margin, and AdaBoost data weight, to automatically group training examples into these three categories. Experimental results on artificial datasets show that, although the three methods have quite different nature, they give similar and reasonable categorization. Results with real-world datasets further demonstrate that treating the three data categories differently in learning can improve generalization.

Download to read the full chapter text

Chapter PDF

Robust linear classification from limited training data

Article 18 November 2021

Semi-supervised local feature selection for data classification

Article 23 August 2021

Dealing with Data Difficulty Factors While Learning from Imbalanced Data

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artificial Intelligence Review 22, 85–126 (2004)
Article MATH Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)
MATH Google Scholar
Guyon, I., Matić, N., Vapnik, V.: Discovering informative patterns and data cleaning. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 181–203. AAAI Press / MIT Press, Cambridge (1996)
Google Scholar
Nicholson, A.: Generalization Error Estimates and Training Data Valuation. PhD thesis, California Institute of Technology (2002)
Google Scholar
Merler, S., Caprile, B., Furlanello, C.: Bias-variance control via hard points shaving. International Journal of Pattern Recognition and Artificial Intelligence 18, 891–903 (2004)
Article Google Scholar
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, National Taiwan University (2003)
Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)
Google Scholar
Hettich, S., Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998), Downloadable at http://www.ics.uci.edu/~mlearn/MLRepository.html

Download references

Author information

Authors and Affiliations

Learning Systems Group, California Institute of Technology, USA
Ling Li, Amrit Pratap, Hsuan-Tien Lin & Yaser S. Abu-Mostafa

Authors

Ling Li
View author publications
You can also search for this author in PubMed Google Scholar
Amrit Pratap
View author publications
You can also search for this author in PubMed Google Scholar
Hsuan-Tien Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yaser S. Abu-Mostafa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6, 4050-190, Porto, Portugal
Luís Torgo
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel Brazdil
Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal
Rui Camacho
Faculty of Economics of the University of Porto, Portugal
João Gama

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, L., Pratap, A., Lin, HT., Abu-Mostafa, Y.S. (2005). Improving Generalization by Data Categorization. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds) Knowledge Discovery in Databases: PKDD 2005. PKDD 2005. Lecture Notes in Computer Science(), vol 3721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564126_19

Download citation

DOI: https://doi.org/10.1007/11564126_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29244-9
Online ISBN: 978-3-540-31665-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Generalization by Data Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Robust linear classification from limited training data

Semi-supervised local feature selection for data classification

Dealing with Data Difficulty Factors While Learning from Imbalanced Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Generalization by Data Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Robust linear classification from limited training data

Semi-supervised local feature selection for data classification

Dealing with Data Difficulty Factors While Learning from Imbalanced Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation