Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools
The advent of Internet over the past few decades has totally revolutionised the fields of Science and Technology. Enormous increase in data on internet has raised the need of effective representation of textual information. The organizers of technical conferences and journals have to place the research papers in various session tracks, for which they need to spend a lot of time. The investigation provides a solution for this problem by automatic document categorization approach with the help of features selection method. Researchers and students constantly face a problem that, it is almost impossible to read most of the newly published papers to be informed of the latest progress. The time spent on reading literature review seems endless. The goal of this research is to design a domain independent automatic text categorization system to alleviate, if not totally solve, this problem. Text categorization is the task of assigning predefined categories to natural language text. This paper explores the effect of word and other values of word in the document, which express the features of a word in the document. The proposed features are exploited by a tf-itf, position of the word, compactness and these features are combined. Experiments show that the feature selection method has been effective for text categorization. The proposed text categorization approach is validated with Naïve Bayesian, Decision Tree Induction, Nearest Neighbour and SVM approaches. The results of the experiment have shown comparatively good accuracy (above 95%), precision and recall, ensuring that the system is more effective and efficient. The experimental results revealed that text categorization had a significant improvement with the help of combination of these features.
Unable to display preview. Download preview PDF.
- 3.Pattern Recognition and Machine Learning. Christopher Bishop. Springer (2006)Google Scholar
- 4.Pattern Classification by Duda, R.O., Hart, P.E., Stork, D.: Wiley and SonsGoogle Scholar
- 5.Ng, A.Y., Jordan, M.I.: On Discriminative vs. Generative Classifiers: A comparison of Logistic Regression and Naive Bayes. In: Neural Information Processing Systems (2002)Google Scholar
- 6.Baoli, L., Shiwen, Y., Qin, L.: An Improved k-Nearest Neighbor Algorithm for Text Categorization Institute of Computational Linguistics Department of Computer Science and Technology Peking University, Beijing, P.R. China, p. 100871Google Scholar
- 7.Auria, L.: Rouslan: Support Vector Machines (SVM) as a Technique for Solvency AnalysisGoogle Scholar
- 9.Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)Google Scholar
- 10.MladeniL ´C, D., Grobelink, M.: Word Sequences as Features in Text Learning. In: Proceedings of the 17th Electro technical and Computer Science Conference (ERK 1998), Ljubljana, Slovenia. IEEE section (1998)Google Scholar
- 11.Xue, X.-B., Zhou, Z.-H.: Distributional features for text categorization. IEEE Trans. Ensembles, IEEE Trans. Knowledge and Data Eng. 21(3) (2009)Google Scholar
- 12.Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of Int’l Conf. on Machine Learning, pp. 412–420 (1997)Google Scholar