A New Domain Independent Keyphrase Extraction System

  • Nirmala Pudota
  • Antonina Dattolo
  • Andrea Baruzzo
  • Carlo Tasso
Part of the Communications in Computer and Information Science book series (CCIS, volume 91)


In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respective feature sets. The proposed approach can be applied to any document, however, in order to know the effectiveness of the system for digital libraries, we have carried out the evaluation on a set of scientific documents, and compared our results with current keyphrase extraction systems.


Digital Library Noun Phrase Query Expansion Single Document Linguistic Knowledge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  2. 2.
    Baruzzo, A., Dattolo, A., Pudota, N., Tasso, C.: A general framework for personalized text classification and annotation. In: Houben, G.-J., McCalla, G., Pianesi, F., Zancanaro, M. (eds.) UMAP 2009. LNCS, vol. 5535, pp. 31–39. Springer, Heidelberg (2009)Google Scholar
  3. 3.
    Baruzzo, A., Dattolo, A., Pudota, N., Tasso, C.: Recommending new tags using domain-ontologies. In: IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 3, pp. 409–412. IEEE, Milan (2009)CrossRefGoogle Scholar
  4. 4.
    Berger, A.L., Mittal, V.O.: Ocelot: a system for summarizing web pages. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 144–151. ACM, New York (2000)Google Scholar
  5. 5.
    Bracewell, D.B., Ren, F., Kuroiwa, S.: Multilingual single document keyword extraction for information retrieval. In: Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, pp. 517–522 (2005)Google Scholar
  6. 6.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)Google Scholar
  7. 7.
    Dattolo, A., Ferrara, F., Tasso, C.: Supporting personalized user concept spaces and recommendations for a publication sharing system. In: Houben, G.-J., McCalla, G., Pianesi, F., Zancanaro, M. (eds.) UMAP 2009. LNCS, vol. 5535, pp. 325–330. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  8. 8.
    D’Avanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization purposes: the lake system at duc2004. In: DUC Workshop, Human Language Technology conference/North American chapter of the Association for Computational Linguistics annual meeting, Boston, USA (2004)Google Scholar
  9. 9.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 668–673. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
  10. 10.
    Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: Keyphrase extraction for document clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 265–274. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  11. 11.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics, Morristown (2003)CrossRefGoogle Scholar
  12. 12.
    Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)CrossRefGoogle Scholar
  13. 13.
    Kosovac, B., Vanier, D.J., Froese, T.M.: Use of keyphrase extraction software for creation of an AEC/FM thesaurus. Electronic Journal of Information Technology in Construction 5, 25–36 (2000)Google Scholar
  14. 14.
    Krulwich, B., Burkey, C.: Learning user information interests through the extraction of semantically significant phrases. In: Hearst, M., Hirsh, H. (eds.) AAAI 1996 Spring Symposium on Machine Learning in Information Access, pp. 110–112. AAAI Press, California (1996)Google Scholar
  15. 15.
    Kumar, N., Srinathan, K.: Automatic keyphrase extraction from scientific documents using n-gram filtration technique. In: Proceedings of the Eight ACM symposium on Document engineering, pp. 199–208. ACM, New York (2008)CrossRefGoogle Scholar
  16. 16.
    Litvak, M., Last, M.: Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. ACL, Morristown (2008)CrossRefGoogle Scholar
  17. 17.
    Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 257–266. ACL, Singapore (2009)Google Scholar
  18. 18.
    Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. ACL, Singapore (2009)Google Scholar
  19. 19.
    Nguyen, T.D., Kan, M.Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.L., Cao, T.H., Sølvberg, I., Rasmussen, E.M. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  20. 20.
    Porter, M.F.: An algorithm for suffix stripping. Readings in information retrieval, 313–316 (1997)Google Scholar
  21. 21.
    Song, M., Song, I.Y., Allen, R.B., Obradovic, Z.: Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS joint Conference on Digital libraries, pp. 202–209. ACM, New York (2006)CrossRefGoogle Scholar
  22. 22.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)CrossRefGoogle Scholar
  23. 23.
    Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd National Confernce on Artificial Intelligence, pp. 855–860. AAAI Press, Chicago (2008)Google Scholar
  24. 24.
    Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on Digital libraries, pp. 254–255. ACM, New York (1999)CrossRefGoogle Scholar
  25. 25.
    Wu, Y.F.B., Li, Q.: Document keyphrases as subject metadata: incorporating document key concepts in search results. Information Retrieval 11(3), 229–249 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Nirmala Pudota
    • 1
  • Antonina Dattolo
    • 1
  • Andrea Baruzzo
    • 1
  • Carlo Tasso
    • 1
  1. 1.Artificial Intelligence Lab, Department of Mathematics and Computer ScienceUniversity of UdineItaly

Personalised recommendations