Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian

  • Josip Saratlija
  • Jan Šnajder
  • Bojana Dalbelo Bašić
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6836)


Labeling documents with keyphrases is a tedious and expensive task. Most approaches to automatic keyphrases extraction rely on supervised learning and require manually labeled training data. In this paper we propose a fully unsupervised keyphrase extraction method, differing from the usual generic keyphrase extractor in the manner the keyphrases are formed. Our method begins by building topically related word clusters from which document keywords are selected, and then expands the selected keywords into syntactically valid keyphrases. We evaluate our approach on a Croatian document collection annotated by eight human experts, taking into account the high subjectivity of the keyphrase extraction task. The performance of the proposed method reaches up to F1 = 44.5%, which is outperformed by human annotators, but comparable to a supervised approach.


Information extraction keyphrase extraction unsupervised learning k-means Croatian language 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ahel, R., Dalbelo Bašić, B., Šnajder, J.: Automatic keyphrase extraction from Croatian newspaper articles. In: The Future of Information Sciences, Digital Resources and Knowledge Sharing, pp. 207–218 (2009)Google Scholar
  2. 2.
    Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, Philadelphia, pp. 1027–1035 (2007)Google Scholar
  3. 3.
    Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of COLT 1998, pp. 92–100. ACM, New York (1998)Google Scholar
  4. 4.
    Delip, R., Deepak, P., Deepak, K.: Corpus based unsupervised labeling of documents. In: FLAIRS Conference, pp. 321–326 (2002)Google Scholar
  5. 5.
    Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management 43(6), 1705–1714 (2007)CrossRefGoogle Scholar
  6. 6.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proc. of IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  7. 7.
    Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Unsupervised keyphrase extraction for search ontologies. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 25–36. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: Making sense of the state-of-the-art. In: Coling 2010: Posters, Beijing, pp. 365–373 (2010)Google Scholar
  9. 9.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)Google Scholar
  10. 10.
    Li, D., Li, S., Li, W., Wang, W., Qu, W.: A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In: Proc. of the ACL 2010, ACLShort 2010, pp. 296–300. ACL (2010)Google Scholar
  11. 11.
    Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proc. of NAACL 2009, pp. 620–628. ACL (2009)Google Scholar
  12. 12.
    Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proc. of EMNLP 2009, pp. 257–266. ACL, Singapore (2009)Google Scholar
  13. 13.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar
  14. 14.
    McCallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press, Menlo Park (1998)Google Scholar
  15. 15.
    Mijić, J., Dalbelo Bašić, B., Šnajder, J.: Robust keyphrase extraction for a large-scale Croatian news production system. In: Proc. of FASSBL 2010, Dubrovnik, pp. 59–66 (2010)Google Scholar
  16. 16.
    van Rijsbergen, C.J.: Informaton Retrieval. Butterworths, London (1979)Google Scholar
  17. 17.
    Turney, P.D.: Learning to extract keyphrases from text. Tech. rep., NRC-IIT (2002)Google Scholar
  18. 18.
    Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic acquisition of inflectional lexica for morphological normalisation. Information Processing and Management 44(5), 1720–1731 (2008)CrossRefGoogle Scholar
  19. 19.
    Zesch, T., Gurevych, I.: Approximate matching for evaluating keyphrase extraction. In: Proc. of RANLP 2009, pp. 484–489 (2009)Google Scholar
  20. 20.
    Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 85–96. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Josip Saratlija
    • 1
  • Jan Šnajder
    • 1
  • Bojana Dalbelo Bašić
    • 1
  1. 1.Faculty of Electrical Engineering and ComputingUniversity of ZagrebCroatia

Personalised recommendations