Text Categorization Based on Regularized Linear Classification Methods

Zhang, Tong; Oles, Frank J.

doi:10.1023/A:1011441423217

Text Categorization Based on Regularized Linear Classification Methods

Published: April 2001

Volume 4, pages 5–31, (2001)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Text Categorization Based on Regularized Linear Classification Methods

Download PDF

Tong Zhang¹ &
Frank J. Oles¹

1400 Accesses
227 Citations
Explore all metrics

Abstract

A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is therefore worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.

References

Apte C, Damerau F and Weiss SM (1994) Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12: 233-251.
Google Scholar
Cooper WS, Gey FC and Dabney DP (1992) Probabilistic retrieval based on staged logistic regression. In: SGIR 92, pp. 198-210.
Google Scholar
Cortes C and Vapnik V (1995) Support vector networks. Machine Learning, 20: 273-297.
Google Scholar
Dumais S, Platt J, Heckerman D and Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, pp. 148-155.
Fuhr N and Pfeifer U (1991) Combining model-oriented and description-oriented approaches for probabilistic indexing. In: SIGIR 91, pp. 46-56.
Google Scholar
Gey FC (1994) Inferring probability of relevance using the method of logistic regression. In: SIGIR 94, pp. 222-231.
Google Scholar
Golub G and Van Loan C (1996) Matrix Computations, 3rd ed. Johns Hopkins University Press, Baltimore, MD.
Google Scholar
Hastie TJ and Tibshirani RJ (1990) Generalized Additive Models, Chapman and Hall Ltd., London.
Google Scholar
Hoerl AE and Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1): 55-67.
Google Scholar
Ittner DJ, Lewis DD and Ahn DD (1995) Text categorization of low quality images. In: Symposium on Document Analysis and Information Retrieval, pp. 301-315.
Jaakkola T, Diekhans M and Haussler D (2000) A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7: 95-114.
Google Scholar
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: European Conference on Machine Learing, ECML-98, pp. 137-142.
Google Scholar
Lewis DD and Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR 94, pp. 3-12.
Google Scholar
McCallum A and Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
Minsky M and Papert S (1990) Perceptrons, MIT Press, Cambridge, MA, expanded edition.
Google Scholar
Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, Bartlett P, Scholkopf B and Schuurmans D, Eds. Advances in Large Margin Classifiers, MIT Press, Cambridge, MA.
Google Scholar
Ripley B (1996) Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, MA.
Google Scholar
Rockafellar RT (1970) Convex Analysis, Princeton University Press, Princeton, NJ.
Google Scholar
Schölkopf B, Burges CJC and Smola AJ, Eds. (1999) Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA.
Google Scholar
Schütze H, Hull DA and Pedersen JO (1995) A comparison of classifiers and document representations for the routing problem. In: SIGIR 95, pp. 229-237.
Google Scholar
Vapnik V (1998) Statistical Learning Theory, John Wiley & Sons, New York.
Google Scholar
Wahba G (1999) Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, Ch.6.
Google Scholar
Weiss S, Apte C, Damerau F, Johnson D, Oles F, Goetz T and Hampp T (1999) Maximizing text-mining performance. IEEE Intelligent Systems, 14: 69-90.
Google Scholar
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval Journal, 1: 69-90.
Google Scholar
Yang Y and Chute CG (1994) An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12: 252-277.
Google Scholar
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: SIGIR 99, pp. 42-49.
Google Scholar
Yang Y and Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning.

Download references

Author information

Authors and Affiliations

Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598
Tong Zhang & Frank J. Oles

Authors

Tong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Frank J. Oles
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, T., Oles, F.J. Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4, 5–31 (2001). https://doi.org/10.1023/A:1011441423217

Download citation

Issue Date: April 2001
DOI: https://doi.org/10.1023/A:1011441423217

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Text Categorization Based on Regularized Linear Classification Methods

Abstract

Article PDF

Similar content being viewed by others

Analytic Feature Selection for Support Vector Machines

Classification With Support Vector Machines and Kolmogorov-Smirnov Bounds

Linear Support Vector Machines

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Text Categorization Based on Regularized Linear Classification Methods

Abstract

Article PDF

Similar content being viewed by others

Analytic Feature Selection for Support Vector Machines

Classification With Support Vector Machines and Kolmogorov-Smirnov Bounds

Linear Support Vector Machines

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation