An Approach to Clustering Abstracts

  • Mikhail Alexandrov
  • Alexander Gelbukh
  • Paolo Rosso
Conference paper

DOI: 10.1007/11428817_25

Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)
Cite this paper as:
Alexandrov M., Gelbukh A., Rosso P. (2005) An Approach to Clustering Abstracts. In: Montoyo A., Muńoz R., Métais E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg

Abstract

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein’s MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov’s proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Mikhail Alexandrov
    • 1
    • 2
  • Alexander Gelbukh
    • 1
  • Paolo Rosso
    • 2
  1. 1.Center for Computing ResearchNational Polytechnic InstituteMexico
  2. 2.Polytechnic University of ValenciaSpain

Personalised recommendations