An Approach to Clustering Abstracts

Alexandrov, Mikhail; Gelbukh, Alexander; Rosso, Paolo

doi:10.1007/11428817_25

An Approach to Clustering Abstracts

Mikhail Alexandrov^19,20,
Alexander Gelbukh¹⁹ &
Paolo Rosso²⁰

Conference paper

1432 Accesses
21 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Abstract

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein’s MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov’s proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.

Work done under partial support of the Government of Valencia, Mexican Government (CONACyT, SNI, CGPI, COFAA-IPN), R2D2 CICYT (TIC2003-07158-C04-03), and ICT EU-India (ALA/95/23/2003/077-054).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexandrov, M., Blanco, X., Makagonov, P.: Testing Word Similarity: Language Independent Approach with Examples from Romance. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 223–234. Springer, Heidelberg (2004)
Chapter Google Scholar
Alexandrov, M., Gelbukh, A., Rosso, P.: Clustering Very Short Documents based on Grouping Keywords. In: Abstracts of the 30-th Latin-American Conf. on Informatics, Univ. Edition, Peru, p. 133 (2004)
Google Scholar
Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Eissen, S., Stein, M.B.: Analysis of Clustering Algorithms for Web-based Search. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 168–178. Springer, Heidelberg (2002)
Chapter Google Scholar
Gelbukh, A. (ed.): CICLing 2002. LNCS, vol. 2276. Springer, Heidelberg (2002), www.CICLing.org
MATH Google Scholar
Hardy, A., Andre, P.: An investigation of nine procedures for detecting the structure in a data set. In: Advances in data science and classification. Studies in Classification, Data Analysis and Knowledge Organization, pp. 29–36. Springer, Heidelberg (1998)
Google Scholar
Hartigan, J.: Clustering Algorithms. Wiley, Chichester (1975)
MATH Google Scholar
Hynek, J., Jezek, K., Rohlikm, O.: Short Document Categorization – Itemsets Method. In: PKDD-2000. LNCS, p. 6. Springer, Heidelberg (2000)
Google Scholar
Kang, B.-Y., Kim, H.-J., Lee, S.-J.: Performance Analysis of Semantic Indexing in Text Retrieval. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 433–436. Springer, Heidelberg (2004)
Chapter Google Scholar
Makagonov, P., Alexandrov, M., Sboychakov, K.: Keyword-based technology for clustering short documents. In: Selected Papers. Computing Research, CIC-IPN, Mexico, pp. 105–114 (2000)
Google Scholar
Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domainoriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)
Google Scholar
Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)
Chapter Google Scholar
Manning, D.C., Schutze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Salton, G., Buckley, C.: Term-weighted approaches in autometic retrieval. Information Processing in Management 24(5), 513–523 (1988)
Article Google Scholar
Solomon, G.: Data dependent methods of cluster analysis. In: Classification and Clustering, pp. 129–147. Academic Press, London (1977) (Russian version)
Google Scholar
Stein, B., Niggemann, O.: On the Nature of Structure and Its Identification. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 122–134. Springer, Heidelberg (1999)
Chapter Google Scholar
Stein, B., Eissen, S.M.: Document Categorization with MajorClust. In: Proc. 12th Workshop on Information Technology and Systems, Tech. Univ. of Barcelona, Spain, p. 6 (2002)
Google Scholar
Stein, B., Eissen, S.M.z.: Automatic document categorization. In: Günter, A., Kruse, R., Neumann, B. (eds.) KI 2003. LNCS (LNAI), vol. 2821, pp. 254–266. Springer, Heidelberg (2003)
Chapter Google Scholar
Stein, B., Eissen, S.M., Wissbrock, F.: On Cluster Validity and the Information Need of Users. In: Proc. 3-rd IASTED Intern. Conf. on Artificial Intelligence and Applications (AIA 2003), pp. 216–221. Acta Press (2003)
Google Scholar
Strzalkowski, T. (ed.): Natural Language and Information Retrieval. Kluwer Academic Publishers, Dordrecht (1999)
Google Scholar
Zizka, J., Bourek, A.: Automated Selection of Interesting Medical Text Documents by the TEA Text Analyzer. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 402–404. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico
Mikhail Alexandrov & Alexander Gelbukh
Polytechnic University of Valencia, Spain
Mikhail Alexandrov & Paolo Rosso

Authors

Mikhail Alexandrov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
Andrés Montoyo
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Lab. CEDRIC, CNAM, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alexandrov, M., Gelbukh, A., Rosso, P. (2005). An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_25

Download citation

DOI: https://doi.org/10.1007/11428817_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics