Abstract
This research is concerned with the improved version of table-based matching algorithm as the approach to text categorization tasks. It is intended to tackle the three problems in encoding texts into numerical vectors and the unstable performance by the fluctuations from text lengths in the previous version. In this research, we encode texts into tables rather than into numerical vectors, define the similarity measure between two tables which is always as a normalized value between zero and one, and apply it to the tasks of text categorization. As the benefits from this research, we expect better performance by solving the three problems resulting from encoding texts into numerical vectors, and more stable performance by improving the previous version. Therefore, we empirically validate the proposed approach through the four sets of experiments, with respect to both performance and stability.
Similar content being viewed by others
References
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Cristianini N, Shawe-Taylor J (2000) Support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK
Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054
Eyheramendy S, Lewis D, Madigan D (2003) On the Naive Bayes model for text categorization. In: The Proceedings of the 9th international workshop on artificial intelligence and statistics, pp 165–171
Hearst M (1998) Support vector machines. IEEE Intell Syst 13(4):18–28
Jo T (2000) NeuroTextCategorizer: a new model of neural network for text categorization. In: The Proceedings of ICONIP 2000, pp 280–285
Jo T (2004) Machine learning based approach to text categorization with resampling methods. In: The Proceedings of the 8th world multi-conference on systemics, cybernetics and informatics, pp 93–98
Jo T, Lee M (2007) Mistaken driven and unconditional learning of NTC. Lect Notes Comput Sci 4491:1205–1214
Jo T, Cho D (2008) Index based approach for text categorization. Int J Math Comput Simul 2(1):127–132
Jo T (2008) Table based matching algorithm for soft categorization of news articles in Reuter 21578. J Korea Multimed Soc 11(6):875– 882
Jo T (2008) Single pass algorithm for text clustering by encoding documents into tables. J Korea Multimed Soc 11(12):1749–1757
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: The Proceedings of 10th European conference on machine learning, pp 143–151
Jo T, Seo J (2001) ’Text categorization oriented connectionist model. In: The Proceedings of ICCPOL 2001, pp 65–68
Kononenko I (1989) ID3, sequential Bayes, naive Bayes and Bayesian neural networks. In: The Proceedings of 4th European working session on learning, Montpellier, pp 91–98
Lee K, Kageura K (2007) Virtual relevant documents in text categorization with support vector machines. Inf Process Manag 43(4):902– 913
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification with string kernels. J Mach Learn Res 2(2):419–444
Massand B, Linoff G, Waltz D (1992) Classifying news stories using memory based reasoning. In: The Proceedings of 15th ACM international conference on research and development in information retrieval, pp 59–65
McClelland J, Rumelhart D (1986) Parallel distributed processing, vol 1 and 2. MIT Press, Cambridge, MA, USA
Mitchell TM (1997) Machine learning. McGraw-Hill, Singapore
Mladenic D, Grobelink M (1999) Feature selection for unbalanced class distribution and Naive Bayes. In: The Proceedings of international conference on machine learning, pp 256–267
Peters C, Koster CHA (2002) Uncertainty-based noise reduction and term selection in text categorization. Lect Note Comput Sci 2291:248–267
Ruiz ME, Srinivasan P (2002) Hierarchical text categorization using neural networks. Inf Retr 5(1):87–118
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Snchez SN, Triantaphyllou E, Kraft D (2002) A feature mining based approach for the classification of text documents into disjoint classes. Inf Process Manag 38(4):583–604
Tai X, Ren F, Kita K (2002) An information retrieval model based on vector space method by supervised learning. Inf Process Manag 38(6):749–764
Wang C, Wang W (2005) Using term clustering and supervised term affinity construction to boost text classification. Lect Note Comput Sci 3518:813–819
Wiener ED (1995) A neural network approach to topic spotting in text. The Thesis of Master of University of Colorado
Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):67–88
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by J.-W. Jung.
Rights and permissions
About this article
Cite this article
Jo, T. Normalized table-matching algorithm as approach to text categorization. Soft Comput 19, 839–849 (2015). https://doi.org/10.1007/s00500-014-1411-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1411-9