An Integrated Approach to Improve the Text Categorization Using Semantic Measures

Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 32)

Abstract

Categorization of text documents plays a vital role in information retrieval systems. Clustering the text documents which supports for effective classification and extracting semantic knowledge is a tedious task. Most of the existing methods perform the clustering based on factors like term frequency, document frequency and feature selection methods. But still accuracy of clustering is not up to mark. In this paper we proposed an integrated approach with a metric named as Term Rank Identifier (TRI). TRI measures the frequent terms and indexes them based on their frequency. For those ranked terms TRI will finds the semantics and corresponding class labels. In this paper, we proposed a Semantically Enriched Terms Clustering (SETC) Algorithm, it is integrated with TRI improves the clustering accuracy which leads to incremental text categorization. Our experimental analysis on different data sets proved that the proposed SETC performing better.

Keywords

Text categorization Clustering Semantic knowledge Term rank identifier Semantically enriched terms clustering 

References

  1. 1.
    Liu, X., Song, Y., Liu, S., Wang, H.: Automatic taxonomy construction from keywords. In: Proceedings of KDD’12, pp. 12–16, August, Beijing, China (2012)Google Scholar
  2. 2.
    Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20(5), 641–651 (2008)CrossRefGoogle Scholar
  3. 3.
    Doucet, A., Ahonen-Myka, H.: Non-contiguous word sequences for information retrieval. In: Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004). Workshop on Multiword Expressions and Integrating Processing, pp. 88–95 (2004)Google Scholar
  4. 4.
    Fung, B.C.M., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining, pp. 59–70 (2003)Google Scholar
  5. 5.
    Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442 (2002)Google Scholar
  6. 6.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD-2000 Workshop on Text Mining, pp. 1–20 (2000)Google Scholar
  7. 7.
    Ahonen-Myka, H.: Finding all maximal frequent sequences in text. In: Proceedings of ICML-99 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999)Google Scholar
  8. 8.
    A Clustering Toolkit, Release 2.1.1. http://www.cs.umn.edu/karypis/cluto/
  9. 9.
    Beydoun, G., Garcia-Sanchez, F., Vincent-Torres, C.M., Lopez-Lorca, A.A., Martinez-Bejar, R.: Providing metrics and automatic enhancement for hierarchical taxonomies. Inf. Process. Manage. 49(1), 67–82 (2013) Google Scholar
  10. 10.
    Pont, U., Hayegenfar, F.S., Ghiassi, N., Taheri, M., Sustr, C., Mahdavi, A.: A semantically enriched optimization environment for performance-guided building design and refurbishment. In: Proceedings of the 2nd Central European Symposium on Building Physics, pp. S. 19–26, 9–11 Sept 2013, Vienna, Austria. (2013). ISBN 978-3-85437-321-6Google Scholar
  11. 11.
    Ahonen-Myka, H.: Discovery of frequent word sequences in text. In: Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery in Data Mining, pp. 16–19 (2002)Google Scholar
  12. 12.
    The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/lemur/
  13. 13.
    Data Mining: Concepts and Techniques—Jiawei Han, Micheline Kamber Harcourt India, 3rd edn. Elsevier, Amsterdam (2007)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of CSEJNTU College of EngineeringKakinadaIndia

Personalised recommendations