Advertisement

Keyword Extraction from Short Documents Using Three Levels of Word Evaluation

  • Mika Timonen
  • Timo Toivanen
  • Melissa Kasari
  • Yue Teng
  • Chao Cheng
  • Liang He
Part of the Communications in Computer and Information Science book series (CCIS, volume 415)

Abstract

In this paper we propose a novel approach for keyword extraction from short documents where each document is assessed on three levels: corpus level, cluster level and document level. We focus our efforts on documents that contain less than 100 words. The main challenge we are facing comes from the main characteristic of short documents: each word occurs usually only once within the document. Therefore, the traditional approaches based on term frequency do not perform well with short documents. To tackle this challenge we propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE). We compare the performance of the proposed approach is against other keyword extraction methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. In the experimental evaluation IKE shows promising results by out-performing the competition.

Keywords

Keyword Extraction Machine Learning Short Documents Term Weighting Text Mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Timonen, M., Silvonen, P., Kasari, M.: Classification of short documents to categorize consumer opinions. In: Online Proceedings of 7th International Conference on Advanced Data Mining and Applications (ADMA 2011), China (2011), http://aminer.org/PDF/adma2011/session3D/adma11_conf_32.pdf (accessed October 10, 2012)
  2. 2.
    Timonen, M.: Categorization of very short documents. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 5–16 (2012)Google Scholar
  3. 3.
    Timonen, M.: Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion. PhD thesis, University of Helsinki, Faculty of Science, Department of Computer Science (2013)Google Scholar
  4. 4.
    Ohsawa, Y., Benson, N.E., Yachida, M.: KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings of IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL 1998), pp. 12–18 (1998)Google Scholar
  5. 5.
    Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference (FLAIR 2003), USA, pp. 392–396 (2003)Google Scholar
  6. 6.
    Wan, X., Xiao, J.: CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of 22nd International Conference on Computational Linguistics (COLING 2008), United Kingdom, pp. 969–976 (2008)Google Scholar
  7. 7.
    Timonen, M., Toivanen, T., Teng, Y., Chen, C., He, L.: Informativeness-based keyword extraction from short documents. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 411–421 (2012)Google Scholar
  8. 8.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 1999), Sweden, pp. 668–673 (1999)Google Scholar
  9. 9.
    Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries (DL 1999), USA, pp. 254–255 (1999)Google Scholar
  10. 10.
    Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Mexico, pp. 434–442 (2003)Google Scholar
  11. 11.
    Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  12. 12.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2, 303–336 (2000)CrossRefGoogle Scholar
  13. 13.
    Hulth, A., Karlgren, J., Jonsson, A., Boström, H., Asker, L.: Automatic keyword extraction using domain knowledge. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 472–482. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  14. 14.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Japan, pp. 216–223 (2003)Google Scholar
  15. 15.
    Hulth, A.: Enhancing linguistically oriented automatic keyword extraction. In: Proceedings of the Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004), USA, pp. 17–20 (2004)Google Scholar
  16. 16.
    Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management 43, 1705–1714 (2007)CrossRefGoogle Scholar
  17. 17.
    Paukkeri, M., Nieminen, I.T., Pöllä, M., Honkela, T.: A language-independent approach to keyphrase extraction and evaluation. In: Posters Proceedings of 22nd International Conference on Computational Linguistics (COLING 2008), United Kingdom, pp. 83–86 (2008)Google Scholar
  18. 18.
    Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, Spain, July 25-26, pp. 404–411 (2004)Google Scholar
  19. 19.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford (1998)Google Scholar
  20. 20.
    HaCohen-Kerner, Y.: Automatic extraction of keywords from abstracts. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 843–849. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    HaCohen-Kerner, Y., Gross, Z., Masa, A.: Automatic extraction and learning of keyphrases from scientific articles. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 657–669. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  22. 22.
    Andrade, M., Valencia, A.: Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998)CrossRefGoogle Scholar
  23. 23.
    Kim, S., Medelyan, O., Kan, M., Baldwin, T.: Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation (ACL 2010), pp. 21–26 (2010)Google Scholar
  24. 24.
    Yih, W., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th international conference on World Wide Web (WWW 2006), Scotland, May 23-26, pp. 213–222 (2006)Google Scholar
  25. 25.
    Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions (2003)Google Scholar
  26. 26.
    Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Brazil, pp. 353–360 (2005)Google Scholar
  27. 27.
    Clark, K., Gale, W.: Inverse Document Frequency (IDF): A measure of deviation from Poisson. In: Third Workshop on Very Large Corpora, pp. 121–130. Massachusetts Institute of Technology, Cambridge (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Mika Timonen
    • 1
  • Timo Toivanen
    • 1
  • Melissa Kasari
    • 2
  • Yue Teng
    • 3
  • Chao Cheng
    • 3
  • Liang He
    • 3
  1. 1.VTT Technical Research Centre of FinlandEspooFinland
  2. 2.Department of Computer ScienceUniversity of HelsinkiFinland
  3. 3.Institute of Computer ApplicationsEast China Normal UniversityShanghaiChina

Personalised recommendations