Abstract
Naïve Bayes(NB), kNN and Adaboost are three commonly used text classifiers. Evaluation of these classifiers involves a variety of factors to be considered including benchmark used, feature selections, parameter settings of algorithms, and the measurement criteria employed. Researchers have demonstrated that some algorithms outperform others on some corpus, however, labeling and corpus bias are two concerns in text categorization. This paper focuses on evaluating the three commonly used text classifiers by using an automatically generated text document set which is labelled by a group of experts to alleviate subjectiveness of labelling, and at the same time to examine how the performance of the algorithms is influenced by feature selection algorithms and the number of features selected.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Lewis, D.D., et al.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Aggarwal, C.C., Zhai, C.: A Survey of Text Classification Algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012)
Manning, C.D., et al.: Introduction to Information Retrieval. Cambridge University Press (2008)
Schapire, R.E., Singer, Y.: Boostexter: A Boosting-based System for Text Categorization. Machine Learning 39(2-3), 135–168 (2000)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Inform. Retrieval 1(2), 69–90 (1999)
Hersh, W., et al.: OHSUMED: An interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 192–201 (1994)
Davidov, D., et al.: Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250–257 (2004)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Zhu, D., Dreher, H.: Characteristics and Uses of Labeled Datasets – ODP Case Study. In: Proceedings of the sixth International Conference on Semantics, Knowledge, and Grids (2010)
Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. Journal of Society for Artificial Intelligence 14(5), 771–780 (1999)
Schapire, R.E., et al.: Boosting and Rocchio Appliced to Text Filtering. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 215–223 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, D., Wong, K.W. (2014). Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8834. Springer, Cham. https://doi.org/10.1007/978-3-319-12637-1_60
Download citation
DOI: https://doi.org/10.1007/978-3-319-12637-1_60
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12636-4
Online ISBN: 978-3-319-12637-1
eBook Packages: Computer ScienceComputer Science (R0)