Semantic Based Category-Keywords List Enrichment for Document Classification

  • Upasana Pandey
  • S. Chakraverty
  • Richa Mihani
  • Ruchika Arya
  • Sonali Rathee
  • Richa K. Sharma
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 166)

Abstract

In this paper we present a text categorization technique that extracts semantic features of documents to generate a compact set of keywords and uses the information obtained from those keywords to perform text classification. The algorithm reduces the dimensionality of the document representation using overlapping semantics. Later, a keyword-category relationship matrix computes the extent of membership of the documents for various input predefined categories. The category of the document is then derived from membership metrics. Also, Wikipedia is used for the purpose of category lists enrichment. The proposed work has shown a new direction towards document classification for web applications.

Keywords

Overlapping Semantics Lexical Chaining Membership metrics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Wajeed, M.A., Adilakshmi, T.: Text Classification using Machine learning. A Journal of Theoretical and Applied Information Technology 7(2) (2009)Google Scholar
  2. 2.
    Wang, Q., Guan, Y., Wang, X.: SVM Based Spam Filter with Active and Online Learning. In: Procs. of the TREC Conference (2006)Google Scholar
  3. 3.
    Androutsopoulos, I., et al.: Learning to filter spam email: a comparison of a naive Bayes and a memory based approach. In: Procs. of the Workshop Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (2000)Google Scholar
  4. 4.
    Bao, Y., Asai, D., Du, X., Yamada, K., Ishii, N.: An Effective Rough Set-Based Method for Text Classification. In: Liu, J., Cheung, Y.-m., Yin, H. (eds.) IDEAL 2003. LNCS, vol. 2690, pp. 545–552. Springer, Heidelberg (2003)Google Scholar
  5. 5.
    Li, C.H., Park, S.C.: Text Categorization Based on Artificial Neural Networks. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006, Part III. LNCS, vol. 4234, pp. 302–311. Springer, Heidelberg (2006)Google Scholar
  6. 6.
    Hovold, J.: Naive Bayes Spam Filtering Using Word Position Based Attributes. In: International Conference of Email and Anti Spam (2005)Google Scholar
  7. 7.
    Yerazunis, W.S.: PhD, Sparse Binary Polynomial Hashing and the CRM114 Discriminator! In: Proc of MIT Spam Conference (2004),http://www.merl.com/papers/docs/TR2004-091.pdf
  8. 8.
  9. 9.
  10. 10.
    El-Alfy, E.-S.M., Al-Qunaieer, F.S.: A Fuzzy Similarity approach for Automated Spam filtering. In: Proc. of the 2008 IEEE/ACS International Conference on Computer Systems and Applications, vol. 00, pp. 544–550 (2008)Google Scholar
  11. 11.
    Chapelle, O., Scholkopf, B., Zien, A.: Book on Semi-Supervised LearningGoogle Scholar
  12. 12.
    Ernandes, M., et al.: An Adaptive context Based algorithm For Term Weighting. In: Proc. of the 20th International Joint Conference on Artificial Intelligence, pp. 2748–2753 (2007)Google Scholar
  13. 13.
    Hu, J.-Z., Shu, J.-B., Huang, Y.-Y.: Text Feature Extraction based on Extension of Topic Words and Fuzzy Set. In: Proc. of 2008 Intl. Conference on Computer Science and Software Engineering (2008)Google Scholar
  14. 14.
    Teich, E., et al.: Exploring Lexical Patterns in Text: Lexical Cohesion Analysis with WordNet. In: Proc. of Interdisciplinary Studies on Information Structure, vol. 02, pp. 129–145 (2005)Google Scholar
  15. 15.
  16. 16.
    Bloehdron, S., et al.: Boosting for Text Classification with Semantics Features. In: Proc. of the MSW 2004 Workshop at the 10th ACM SIGKDD Conference on Knowledge, Discovery and Data Mining, pp. 70–87 (August 2004)Google Scholar
  17. 17.
  18. 18.
  19. 19.
    Luo, N., et al.: Using CoTraining and Semantic Feature Extraction for Positive and Unlabeled Text Classification. In: Proc. of International Seminar on Future Information Technology and Management Engineering (2008)Google Scholar
  20. 20.
    Barak, L., et al.: Text Categorization from Category Name via Lexical Reference. In: Proc. of NAACL HLT 2009: Short papers, pp. 33–36 (June 2009)Google Scholar
  21. 21.
    Haruechaiyasak, C., Shyu, M.-L., Chen, S.-C.: Web Document Classification Based on Fuzzy Association. In: Proc. of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment (2002)Google Scholar
  22. 22.
    Padmaraju, D., et al: Applying Lexical Semantics to Improve Text Classification, http://web2py.iiit.ac.in/publications/default/download/inproceedings.Pdf.9ecb6867-0fb0-48a5-8020-0310468d3275.pdf
  23. 23.
    Muztaba Fuad, M., Deb, D., Shahriar Hossain, M.: A Trainable Fuzzy Spam Detection System, http://people.cs.vt.edu/msh/papers/trainable.pdf
  24. 24.
    Pandey, U., et al.: Context Driven Technique for Document Classification. In: Proc. of ACS (2010)Google Scholar
  25. 25.
    Wikipedia-based Semantic Interpretation for Natural Language Processing by Shaul Markovitch, Department of Computer Science Technion|Israel Institute of Technology (2009)Google Scholar
  26. 26.
    Learning to Link with Wikipedia by David Milne, Department of Computer Science, University of Waikato (2008)Google Scholar
  27. 27.
    Building Semantic Kernels for Text Classification using Wikipedia by Pu Wang, Department of Computer Science, George Mason University (2007)Google Scholar
  28. 28.
  29. 29.
  30. 30.

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2012

Authors and Affiliations

  • Upasana Pandey
    • 1
  • S. Chakraverty
    • 1
  • Richa Mihani
    • 1
  • Ruchika Arya
    • 1
  • Sonali Rathee
    • 1
  • Richa K. Sharma
    • 1
  1. 1.Netaji Subhas Institute of Technology (NSIT)New DelhiIndia

Personalised recommendations