Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian

  • Josip Saratlija
  • Jan Šnajder
  • Bojana Dalbelo Bašić
Conference paper

DOI: 10.1007/978-3-642-23538-2_43

Part of the Lecture Notes in Computer Science book series (LNCS, volume 6836)
Cite this paper as:
Saratlija J., Šnajder J., Dalbelo Bašić B. (2011) Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian. In: Habernal I., Matoušek V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science, vol 6836. Springer, Berlin, Heidelberg

Abstract

Labeling documents with keyphrases is a tedious and expensive task. Most approaches to automatic keyphrases extraction rely on supervised learning and require manually labeled training data. In this paper we propose a fully unsupervised keyphrase extraction method, differing from the usual generic keyphrase extractor in the manner the keyphrases are formed. Our method begins by building topically related word clusters from which document keywords are selected, and then expands the selected keywords into syntactically valid keyphrases. We evaluate our approach on a Croatian document collection annotated by eight human experts, taking into account the high subjectivity of the keyphrase extraction task. The performance of the proposed method reaches up to F1 = 44.5%, which is outperformed by human annotators, but comparable to a supervised approach.

Keywords

Information extraction keyphrase extraction unsupervised learning k-means Croatian language 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Josip Saratlija
    • 1
  • Jan Šnajder
    • 1
  • Bojana Dalbelo Bašić
    • 1
  1. 1.Faculty of Electrical Engineering and ComputingUniversity of ZagrebCroatia

Personalised recommendations