Single pass text classification by direct feature weighting

Malik, Hassan H.; Fradkin, Dmitriy; Moerchen, Fabian

doi:10.1007/s10115-010-0317-9

Single pass text classification by direct feature weighting

Regular Paper
Published: 25 June 2010

Volume 28, pages 79–98, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hassan H. Malik¹,
Dmitriy Fradkin² &
Fabian Moerchen²

277 Accesses
10 Citations
Explore all metrics

Abstract

The Feature Weighting Classifier (FWC) is an efficient multi-class classification algorithm for text data that uses Information Gain to directly estimate per-class feature weights in the classifier. This classifier requires only a single pass over the dataset to compute the feature frequencies per class, is easy to implement, and has memory usage that is linear in the number of features. Results of experiments performed on 128 binary and multi-class text and web datasets show that FWC’s performance is at least comparable to, and often better than that of Naive Bayes, TWCNB, Winnow, Balanced Winnow and linear SVM. On a large-scale web dataset with 12,294 classes and 135,973 training instances, FWC trained in 13 s and yielded comparable classification performance to a state of the art multi-class SVM implementation, which took over 15 min to train.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anagnostopoulos A, Broder A, Punera K (2008) Effective and efficient classification on a search-engine model. Knowl Inf Syst 16(2): 129–154
Article Google Scholar
Cohen W (1995) Fast effective rule induction. In: Proceedings of the international conference on machine learning (ICML). pp 115–123
Crammer K, Singer Y (2002) On the learnability and design of output codes for multiclass problems. Mach Learn 47: 201–233
Article MATH Google Scholar
Davidov D, Gabrilovich E, Markovitch S (2004) Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. In: The 27th annual international ACM SIGIR conference. pp 250–257
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9: 1871–1874
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res (JMLR) 3: 1289–1305
Article MATH Google Scholar
Forman G (2008) BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of 17th ACM conference on information and knowledge management (CIKM). pp 263–270
Gabrilovich E, Markovitch S (2004) Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with c4.5. In: The 21st international conference on machine learning (ICML). pp 321–328
Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning (ICML). pp 377–384
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1): 10–18
Article Google Scholar
Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Springer, Berlin
Google Scholar
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the international conference on knowledge discovery and data mining (KDD). pp 217–226
Junejo KN, Karim A (2008) A robust discriminative term weighting based linear discriminant method for text classification. In: Proceedings of IEEE international conference on data mining (ICDM). pp 323–332
Karypis G (2003) CLUTO: a software package for clustering high dimensional datasets. http://www-users.cs.umn.edu/~karypis/cluto/
Keerthi SS, Sundararajan S, Chang K-W, Hsieh C-J, Lin C-J (2008) A sequential dual method for large scale multi-class linear SVMs. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining
Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial Naive Bayes for text categorization revisited. In: Webb G, Yu X (eds) AI 2004, LNAI 3339. Springer, Berlin, pp 488–499
Google Scholar
Lewis DD, Yang Y, Rose T, Li F (2004) RCV1: a new benchmark collection for text categorization. J Mach Learn Res 5: 361–397
Google Scholar
Littlestone N (1988) Learning quickly when irrelevant attributes are abound: a new linear threshold algorithm. Mach Learn 2: 285–318
Google Scholar
Littlestone N (1989) Mistake bounds and logarithmic linear-threshold learning algorithms. Technical report UCSC-CRL-89-11, University of California, Santa Cruz
Lyman P, Varian HR (2003) How much information? http://www2.sims.berkeley.edu/research/projects/how-much-info-2003
Madani O, Connor M, Greiner W (2009) Learning when concepts abound. J Mach Learn Res 10: 2571–2613
MathSciNet Google Scholar
Malik HH, Kender JR (2008) Classifying high-dimensional text and web data using very short patterns. In: Proceedings of IEEE international conference on data mining (ICDM). pp 923–928
McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: Proceedings of AAAI-98 workshop on learning for text categorization. pp 41–48
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL
Quinlan JR (1986) Induction of decision trees. Mach Learn 1: 81–106
Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufman, Los Altos
Google Scholar
Quinlan JR, Cameron-Jones RM (1993) FOIL: a midterm report. In: Proceedings of the European conference on machine learning (ECML). pp 3–20
Rennie JD (2001) Improving multi-class text classification with Naive Bayes. AI technical report 2001-04, Massachusetts Institute of Technology
Rennie JD, Shih L, Teevan J, Karger D (2003) Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML)
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5): 513–523
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34: 1–47
Article MathSciNet Google Scholar
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–281
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th international conference on machine learning. pp 412–420
Yin X, Han J (2003) CPAR: classification based on predictive association rules. In: Proceedings of the SIAM international conference on data mining (SDM). pp 331–335
Zhang R, Tran T (2010) An information gain-based approach for recommending useful product reviews. Knowl Inf Syst. doi:10.1007/s10115-010-0287-y

Download references

Author information

Authors and Affiliations

Thomson Reuters, 195 Broadway, New York, NY, 10007, USA
Hassan H. Malik
Integrated Data Systems, Siemens Corporate Research, 755 College Rd. East, Princeton, NJ, 08540, USA
Dmitriy Fradkin & Fabian Moerchen

Authors

Hassan H. Malik
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Fradkin
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Moerchen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hassan H. Malik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, H.H., Fradkin, D. & Moerchen, F. Single pass text classification by direct feature weighting. Knowl Inf Syst 28, 79–98 (2011). https://doi.org/10.1007/s10115-010-0317-9

Download citation

Received: 04 February 2010
Revised: 03 May 2010
Accepted: 11 June 2010
Published: 25 June 2010
Issue Date: July 2011
DOI: https://doi.org/10.1007/s10115-010-0317-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Single pass text classification by direct feature weighting

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Supervised Classification Algorithms in Machine Learning: A Survey and Review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Single pass text classification by direct feature weighting

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Supervised Classification Algorithms in Machine Learning: A Survey and Review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation