Skip to main content

Clustering Abstracts Instead of Full Texts

  • Conference paper
Text, Speech and Dialogue (TSD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3206))

Included in the following conference series:

Abstract

Accessibility of digital libraries and other web-based repositories has caused the illusion of accessibility of the full texts of scientific papers. However, in the majority of cases such an access (at least free access) is limited only to abstracts having no more then 50–100 words. Traditional keyword-based approach for clustering this type of documents gives unstable and imprecise results. We show that they can be easy improved with more adequate keyword selection and document similarity evaluation. We suggest simple procedures for this. We evaluate our approach on the data from two international conferences. One of our conclusions is the suggestion for the digital libraries and other repositories to provide document images of full texts of the papers along with their abstracts for open access via Internet.

Work done under partial support of Mexican Government (CONACyT, SNI, CGPI, COFAA) and Korean Government (KIPA professorship). The third author is currently on Sabbatical leave at Chung-Ang University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexandrov, M., Gelbukh, A., Makagonov, P.: On metrics for keyword-based document selection and classification. In: CICLing 2000, Proceedings of the 1st Intern. Conf. on Intelligent Text Processing and Computational Linguistics, Mexico, pp. 373–389 (2000)

    Google Scholar 

  2. Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)

    Google Scholar 

  3. Gelbukh, A. (ed.): CICLing 2002. LNCS, vol. 2276. Springer, Heidelberg (2002)

    MATH  Google Scholar 

  4. Hartigan, J.: Clustering Algorithms. Wiley, Chichester (1975)

    MATH  Google Scholar 

  5. Hynek, J., Ježek, K., Rohlik, O.: Short Document Categorization – Itemsets Method. In: PKDD 2000. LNCS, vol. 1910, p. 6. Springer, Heidelberg (2000)

    Google Scholar 

  6. Kiers, H., et al. (eds.): IFCS 2000, Proceedings of 7th Intern. Conf. on Data Analysis, Classification, and Related Methods. Studies in classification, data analysis, and knowledge organization. Springer, Heidelberg (2000)

    Google Scholar 

  7. Makagonov, P., Alexandrov, M., Sboychakov, K.: Keyword-based technology for clustering short documents. In: Selected Papers. Computing Research, CIC-IPN, Mexico, pp. 105–114 (2000a)

    Google Scholar 

  8. Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domainoriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000b)

    Google Scholar 

  9. Makagonov, P., Alexandrov, M.: Constructing empirical formulas for testing word similarity by the inductive method of model self-organization. In: Advances in Natural Language Processing, LNAI, vol. 2379, Springer, Heidelberg, pp. 239–247 (2002)

    Google Scholar 

  10. Manning, D.C., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  11. Solomon, G.: Data dependent methods of cluster analysis. In: Classification and Clustering, pp. 129–147. Academic Press, London (1977)

    Google Scholar 

  12. Strzalkowski, T. (ed.): Natural Language and Information Retrieval. Kluwer Academic Publishers, Dordrecht (1999)

    Google Scholar 

  13. Žižka, J., Bourek, A.: Automated Selection of Interesting Medical Text Documents by the TEA Text Analyzer. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 402–404. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Makagonov, P., Alexandrov, M., Gelbukh, A. (2004). Clustering Abstracts Instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30120-2_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23049-6

  • Online ISBN: 978-3-540-30120-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics