A Competitive Term Selection Method for Information Retrieval

  • Franco Rojas López
  • Héctor Jiménez-Salazar
  • David Pinto
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)


Term selection process is a very necessary component for most natural language processing tasks. Although different unsupervised techniques have been proposed, the best results are obtained with a high computational cost, for instance, those based on the use of entropy. The aim of this paper is to propose an unsupervised term selection technique based on the use of a bigram-enriched version of the transition point. Our approach reduces the corpus vocabulary size by using the transition point technique and, thereafter, it expands the reduced corpus with bigrams obtained from the same corpus, i.e., without external knowledge sources. This approach provides a considerable dimensionality reduction of the TREC-5 collection and, also has shown to improve precision for some entropy-based methods.


Information Retrieval Term Selection Vector Space Model Important Term Vocabulary Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baeza-Yates, R., Ribeiro, N.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  2. 2.
    Booth, A.: A law of occurrence of words of low frequency. Information and Control 10(4), 383–396 (1967)CrossRefGoogle Scholar
  3. 3.
    Shannon, C.E.: The Bell System Technical Journal 27, 379 (1948)Google Scholar
  4. 4.
    Gelbukh, A., Sidorov, G., Guzman-Arenas, A.: Use of a weighted topic hierarchy for text retrieval and classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)Google Scholar
  5. 5.
    Jiménez-Salazar, H., Castro, M., Rojas, F., Miñón, E., Pinto, D., Carcedo, F.: Unsupervised Term Selection using Entropy. In: Research on Computing Science 14, México, pp. 163–172 (2005)Google Scholar
  6. 6.
    Montemurro, M.A., Zanette, D.H.: Entropic Analysis of the role of the words in literaty texts, CoRR, arXiv:cond-mat/0109218, v1 12 (Sept. 2001)Google Scholar
  7. 7.
    Moyotl, E.: DPT: un método de selección de términos para categorización de textos, Master in Computer Science Thesis, FCC-BUAP (In spanish) (2005)Google Scholar
  8. 8.
    Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Procesamiento del Lenguaje Natural, España, pp. 141–146.Google Scholar
  9. 9.
    Moyotl, E., Jiménez, H.: Enhancement of DPT Feature Selection Method for Text Categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 706–709. Springer, Heidelberg (2005)Google Scholar
  10. 10.
    Pérez-Carballo, J., Strzalkowski, T.: Natural Language Information Retrieval: progress report. Information Processing and Management 36(1), 155–178 (2000)CrossRefGoogle Scholar
  11. 11.
    Pinto, D., Jiménez-Salazar, H., Rosso, P., Sanchis, E.: BUAP-UPV TPIRS: A System for Document Indexing Reduction at WebCLEF. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Pinto, D., Jiménez-Salazar, H.: Paolo Rosso: Clustering Abstracts of Scientific Texts using the Transition Point Technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Rojas, F., Jiménez, H., Pinto, D., López, A.: Dimensionality reduction for Information Retrieval. Research on Computing Science 20, 107–112 (2006)Google Scholar
  14. 14.
    Rojas, F., Jiménez, H., Pinto, D.: Text Reduction-Enrichment at WebCLEF. In: Proceedings of CLEF 2006, p. 53 (2006)Google Scholar
  15. 15.
    Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  16. 16.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  17. 17.
    Urbizagástegui, A.R.: Las Posibilidades de la Ley de Zipf en la Indización Automática (In spanish) (1999),
  18. 18.
    Yang, Y., Pedersen, P.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of ICML-97, 14th Int. Conf. on Machine Learning, pp. 412–420 (1997)Google Scholar
  19. 19.
    Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Franco Rojas López
    • 1
  • Héctor Jiménez-Salazar
    • 1
  • David Pinto
    • 1
    • 2
  1. 1.Faculty of Computer Science, BUAP, Puebla, 72570 Ciudad UniversitariaMexico
  2. 2.Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/nSpain

Personalised recommendations