Automatic Extraction and Learning of Keyphrases from Scientific Articles

  • Yaakov HaCohen-Kerner
  • Zuriel Gross
  • Asaf Masa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3406)


Many academic journals and conferences require that each article include a list of keyphrases. These keyphrases should provide general information about the contents and the topics of the article. Keyphrases may save precious time for tasks such as filtering, summarization, and categorization. In this paper, we investigate automatic extraction and learning of keyphrases from scientific articles written in English. Firstly, we introduce various baseline extraction methods. Some of them, formalized by us, are very successful for academic papers. Then, we integrate these methods using different machine learning methods. The best results have been achieved by J48, an improved variant of C4.5. These results are significantly better than those achieved by previous extraction systems, regarded as the state of the art.


Machine Learning Method Scientific Article Term Frequency Baseline Method Automatic Extraction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alterman, R.: Text Summarization. In: Shapiro, S.C. (ed.) Encyclopedia of Artificial Intelligence, pp. 1579–1587. John Wiley & Sons, New York (1992)Google Scholar
  2. 2.
    Brandow, B., Mitze, K., Rau, L.F.: Automatic Condensation of Electronic Publications by Sentence Selection. Information Processing and Management 31(5), 675–685 (1994)CrossRefGoogle Scholar
  3. 3.
    D’Avanzo, E., Magnini, B., Vallin, A.: Keyphrase Extraction for Summarization Purposes: The LAKE System at DUC 2004. In: Document Understanding Workshop (2004)Google Scholar
  4. 4.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of ACM-CIK International Conference on Information and Knowledge Management, pp. 148–155. ACM Press, Philadelphia (1998)Google Scholar
  5. 5.
    Edmundson, H.P.: New Methods in Automatic Extraction. Journal of the ACM 16(2), 264–285 (1969)zbMATHCrossRefGoogle Scholar
  6. 6.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-Specific Key-Phrase Extraction. In: Proc. IJCAI, pp. 668–673. Morgan Kaufmann, San Francisco (1999)Google Scholar
  7. 7.
    Gelbukh, A., Sidorov, G., Guzmán-Arenas, A.: A Method of Describing Document Contents through Topic Selection. In: Proc. SPIRE 1999, International Symposium on String Processing and Information Retrieval, Mexico, pp. 73–80 (1999)Google Scholar
  8. 8.
    Gelbukh, A., Sidorov, G., Han, S.-Y., Hernandez-Rubio, E.: Automatic Syntactic Analysis for Detection of Word Combinations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 243–247. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  9. 9.
    HaCohen-Kerner, Y.: Automatic Extraction of Keywords from Abstracts. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 843–849. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    HaCohen-Kerner, Y., Malin, E., Chasson, I.: Summarization of Jewish Law Articles in Hebrew. In: Proceedings of the 16th International Conference on Computer Applications in Industry and Engineering, pp. 172–177. International Society for Computers and Their Applications (ISCA), Las Vegas (2003)Google Scholar
  11. 11.
    HaCohen-Kerner, Y., Stern, I., Korkus, D.: Baseline Keyphrase Extraction Methods from Hebrew News HTML Documents. WSEAS Transactions on Information Science and Applications 6(1), 1557–1562 (2004)Google Scholar
  12. 12.
    Hulth, A.: Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223 (2003)Google Scholar
  13. 13.
    Hulth, A.: Reducing False Positives by Expert Combination in Automatic Keyword Indexing. In: Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, pp. 197–203 (2003)Google Scholar
  14. 14.
    Humphreys, K.J.B.: Phraserate: An HTML Keyphrase Extractor. Technical report, University of California, Riverside, California (2002)Google Scholar
  15. 15.
    Jones, S., Paynter, G.W.: Automatic Extraction of Document Keyphrases for Use in Digital Libraries: Evaluation and Applications. Journal of the American Society for Information Science and Technology 53(8), 653–677 (2002)CrossRefGoogle Scholar
  16. 16.
    Kupiec, J., Pederson, J., Chen, F.: A Trainable Document Summarizer. In: Proceedings of the 18th Annual International ACM SIGIR, pp. 68–73 (1995)Google Scholar
  17. 17.
    Luhn, H.P.: The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development 2(2), 159–165 (1958)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Mani, I., Maybury, M.T.: Advances in Automatic Text Summarization, pp. ix–xv. MIT Press, Cambridge (1999)Google Scholar
  19. 19.
    Neto, J.L., Freitas, A.A., Kaestner, C.A.A.: Automatic Text Summarization Using a Machine Learning Approach. In: Bittencourt, G., Ramalho, G.L. (eds.) SBIA 2002. LNCS (LNAI), vol. 2507, pp. 205–215. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Quinlan, J.R.: C4.5: Programs For Machine Learning. Morgan Kaufmann, Los Altos (1993)Google Scholar
  21. 21.
    Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice-Hall, Upper Saddle River (1995)zbMATHGoogle Scholar
  22. 22.
    Turney, P.: Learning Algorithms for Keyphrase Extraction. Information Retrieval Journal 2(4), 303–336 (2000)CrossRefGoogle Scholar
  23. 23.
    Turney, P.: Coherent Keyphrase Extraction via Web Mining. In: Proceedings of IJCAI 2003, pp. 434–439 (2003)Google Scholar
  24. 24.
    Wu, J., Agogino, A.M.: Automating Keyphrase Building with Multi-Objective Genetic Algorithms. In: Proceedings of the 37th Annual Hawaii International Conference on System Science, HICSS, pp. 104–111 (2003)Google Scholar
  25. 25.
  26. 26.
    Yang, Y., Webb, G.I.: Weighted Proportional k-Interval Discretization for Naïve-Bayes Classifiers. In: Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 501–512 (2003)Google Scholar
  27. 27.
    Zhang, Y., Milios, E., Zincir-Heywood, N.: A Comparison of Keyword- and Keyterm-based Methods for Automatic Web Site Summarization, in Technical Report WS-04-01, Papers from the on Adaptive Text Extraction and Mining, San Jose, CA, pp. 15–20 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Yaakov HaCohen-Kerner
    • 1
  • Zuriel Gross
    • 1
  • Asaf Masa
    • 1
  1. 1.Department of Computer SciencesJerusalem College of Technology (Machon Lev)JerusalemIsrael

Personalised recommendations