Skip to main content

Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools

  • Conference paper

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 249)

Abstract

The advent of Internet over the past few decades has totally revolutionised the fields of Science and Technology. Enormous increase in data on internet has raised the need of effective representation of textual information. The organizers of technical conferences and journals have to place the research papers in various session tracks, for which they need to spend a lot of time. The investigation provides a solution for this problem by automatic document categorization approach with the help of features selection method. Researchers and students constantly face a problem that, it is almost impossible to read most of the newly published papers to be informed of the latest progress. The time spent on reading literature review seems endless. The goal of this research is to design a domain independent automatic text categorization system to alleviate, if not totally solve, this problem. Text categorization is the task of assigning predefined categories to natural language text. This paper explores the effect of word and other values of word in the document, which express the features of a word in the document. The proposed features are exploited by a tf-itf, position of the word, compactness and these features are combined. Experiments show that the feature selection method has been effective for text categorization. The proposed text categorization approach is validated with Naïve Bayesian, Decision Tree Induction, Nearest Neighbour and SVM approaches. The results of the experiment have shown comparatively good accuracy (above 95%), precision and recall, ensuring that the system is more effective and efficient. The experimental results revealed that text categorization had a significant improvement with the help of combination of these features.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-03095-1_13
  • Chapter length: 9 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   269.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-03095-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   349.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani’s, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    CrossRef  Google Scholar 

  2. Xue, X.-B., Zhou, Z.-H.: Distributional Features for Text Categorization. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 497–508. Springer, Heidelberg (2006)

    CrossRef  Google Scholar 

  3. Pattern Recognition and Machine Learning. Christopher Bishop. Springer (2006)

    Google Scholar 

  4. Pattern Classification by Duda, R.O., Hart, P.E., Stork, D.: Wiley and Sons

    Google Scholar 

  5. Ng, A.Y., Jordan, M.I.: On Discriminative vs. Generative Classifiers: A comparison of Logistic Regression and Naive Bayes. In: Neural Information Processing Systems (2002)

    Google Scholar 

  6. Baoli, L., Shiwen, Y., Qin, L.: An Improved k-Nearest Neighbor Algorithm for Text Categorization Institute of Computational Linguistics Department of Computer Science and Technology Peking University, Beijing, P.R. China, p. 100871

    Google Scholar 

  7. Auria, L.: Rouslan: Support Vector Machines (SVM) as a Technique for Solvency Analysis

    Google Scholar 

  8. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    CrossRef  Google Scholar 

  9. Lewis, D.D.: An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)

    Google Scholar 

  10. MladeniL ´C, D., Grobelink, M.: Word Sequences as Features in Text Learning. In: Proceedings of the 17th Electro technical and Computer Science Conference (ERK 1998), Ljubljana, Slovenia. IEEE section (1998)

    Google Scholar 

  11. Xue, X.-B., Zhou, Z.-H.: Distributional features for text categorization. IEEE Trans. Ensembles, IEEE Trans. Knowledge and Data Eng. 21(3) (2009)

    Google Scholar 

  12. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of Int’l Conf. on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Manne, S., Muddana, S., Sohail, A., Fatima, S. (2014). Features Selection Method for Automatic Text Categorization: A Comparative Study with WEKA and RapidMiner Tools. In: Satapathy, S., Avadhani, P., Udgata, S., Lakshminarayana, S. (eds) ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India- Vol II. Advances in Intelligent Systems and Computing, vol 249. Springer, Cham. https://doi.org/10.1007/978-3-319-03095-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-03095-1_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-03094-4

  • Online ISBN: 978-3-319-03095-1

  • eBook Packages: EngineeringEngineering (R0)