Advertisement

Clustering Abstracts of Scientific Texts Using the Transition Point Technique

  • David Pinto
  • Héctor Jiménez-Salazar
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3878)

Abstract

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.

Keywords

Transition Point Digital Library Natural Language Processing Short Text Vocabulary Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  2. 2.
    Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and control (1967)Google Scholar
  3. 3.
    Bueno, C., Pinto, D., Jimenez, H.: El párrafo virtual en la generación de extractos. Research on Computing Science Journal (2005)Google Scholar
  4. 4.
    Cabrera, R., Pinto, D., Jimenez, H., Vilariño, D.: Una nueva ponderación para el modelo de espacio vectorial de recuperación de información. Research on Computing Science Journal (2005)Google Scholar
  5. 5.
    Jimenez, H., Pinto, D., Rosso, P.: Selección de Términos No Supervisada para Agrupamiento de Resúmenes. In: Proceedings of Workshop on Human Language, ENC 2005 (2005)Google Scholar
  6. 6.
    Jiménez-Salazar, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Journal: Procesamiento del Lenguaje Natural (35), 114–118 (2005)Google Scholar
  7. 7.
    Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington DC (2003)Google Scholar
  8. 8.
    Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domain oriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)Google Scholar
  9. 9.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  11. 11.
    Montejo-Ráez, A., Ureña-López, L.A., Steinberger, R.: Text Categorization using bibliographic records: beyond document content. Journal: Procesamiento del Lenguaje Natural, Num (35), 119–116 (2005)Google Scholar
  12. 12.
    Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proceedings of XX Conference of Spanish Natural Language Processing Society, SEPLN 2004 (2004)Google Scholar
  13. 13.
    Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of dtp feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Pinto, D., Pérez, F.: Una Técnica para la Identificación de Términos Multipalabra. In: Proceedings of 2nd. National Conference on Computer Science, México (2004)Google Scholar
  15. 15.
    Hernández, E.M.: DTP, un metodo de selección de términos para agrupamiento de textos, Tesis de maestría, Facultad de Ciencias de la Computación, BUAP (2005)Google Scholar
  16. 16.
    van Rijsbergen, C.J.: Information Retrieval, London, Butterworths (1999)Google Scholar
  17. 17.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  18. 18.
    Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Tovar, M., Carrillo, M., Pinto, D., Jimenez, H.: Combining Keyword Identification Techniques. Journal: Research on Computing Science (2005)Google Scholar
  20. 20.
    Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)Google Scholar
  21. 21.
    Yang, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: Proc. of SIGIR-ACM, pp. 256–263 (1995)Google Scholar
  22. 22.
    Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Pinto
    • 1
    • 2
  • Héctor Jiménez-Salazar
    • 1
  • Paolo Rosso
    • 2
  1. 1.Faculty of Computer ScienceBUAPMexico
  2. 2.Department of Information Systems and ComputationUPVValenciaSpain

Personalised recommendations