Use of a Weighted Topic Hierarchy for Document Classification

  • Alexander Gelbukh
  • Grigori Sidorov
  • Adolfo Guzman-Arénas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1692)


A statistical method of document classification driven by a hierarchical topic dictionary is proposed. The method uses a dictionary with a simple structure and is insensible to inaccuracies in the dictionary. Two kinds of weights of dictionary entries, namely, relevance and discrimination weights are discussed. The first type of weights is associated with the links between words and topics and between the nodes in the tree, while the weights of the second type depend on user database. A common sense-complaint way of assignment of these weights to the topics is presented. A system for text classification Classifier based on the discussed method is described.


Semantic Network Dictionary Entry Topic Detection Relevance Weight Discrimination Weight 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anderson, J. D., Rowley, F. A.: Building End-user Thesauri from Full Text. In: Kwasnik, B. H., Fidel, R. (eds.): Advances in Classification Research. Proceedings of the 2nd ASIS SIG/CR Classification Research Workshop, Vol. 2. Learned Information, Medford, NJ. (1992) 1–13.Google Scholar
  2. 2.
    Cohen, W. W.: Learning Trees and Rules with Setvalued Features. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (1996).Google Scholar
  3. 3.
    Cohen, W., Singer, Y.: Context-sensitive Learning Methods for Text Categorization. In: SIGIR’96 (1996).Google Scholar
  4. 4.
    Gelbukh, A.: Using a Semantic Network for Lexical and Syntactic Disambiguation. In: Proceedings of Simposium Internacional de Computación: Nuevas Aplicaciones e Innovaciones Tecnológicas en Computación. Mexico (1997) 352–366.Google Scholar
  5. 5.
    Guzmán-Arenas, A.: Finding the Main Themes in a Spanish Document. Journal Expert Systems with Applications 14 (1, 2) (1998) 139–148.CrossRefGoogle Scholar
  6. 6.
    Guzmán-Arenas, A.: Hallando los Temas Principales en un Artículo en Español. Soluciones Avanzadas 5 (45) (1997) 58, 5 (49) (1997). 66Google Scholar
  7. 7.
    Jacob, E. K.: Cognition and Classification: A Crossdisciplinary Approach to a Philosophy of Classification. (Abstract.) In: Maxian, B. (ed.): ASIS’ 94: Proceedings of the 57th ASIS Annual Meeting. Medford, NJ: Learned Information (1994) 82.Google Scholar
  8. 8.
    Krowetz, B.: Homonymy and Polysemy in Information Retrieval. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (1997) 72–79.Google Scholar
  9. 9.
    Lewis, D. D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (1994) 81–93.Google Scholar
  10. 10.
    Riloff, E., Shepherd, J.: A Corpus Based Approach for Building Semantic Lexicons. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (1997).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Alexander Gelbukh
    • 1
  • Grigori Sidorov
    • 1
  • Adolfo Guzman-Arénas
    • 1
  1. 1.Natural Language Laboratory, Center for Computing Research (CIC)National Polytechnic Institute (IPN)Zacatenco, Mexico CityMexico

Personalised recommendations