Keyword Extraction from Short Documents Using Three Levels of Word Evaluation

Timonen, Mika; Toivanen, Timo; Kasari, Melissa; Teng, Yue; Cheng, Chao; He, Liang

doi:10.1007/978-3-642-54105-6_9

Mika Timonen⁵,
Timo Toivanen⁵,
Melissa Kasari⁶,
Yue Teng⁷,
Chao Cheng⁷ &
…
Liang He⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 415))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

931 Accesses
3 Citations

Abstract

In this paper we propose a novel approach for keyword extraction from short documents where each document is assessed on three levels: corpus level, cluster level and document level. We focus our efforts on documents that contain less than 100 words. The main challenge we are facing comes from the main characteristic of short documents: each word occurs usually only once within the document. Therefore, the traditional approaches based on term frequency do not perform well with short documents. To tackle this challenge we propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE). We compare the performance of the proposed approach is against other keyword extraction methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. In the experimental evaluation IKE shows promising results by out-performing the competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Timonen, M., Silvonen, P., Kasari, M.: Classification of short documents to categorize consumer opinions. In: Online Proceedings of 7th International Conference on Advanced Data Mining and Applications (ADMA 2011), China (2011), http://aminer.org/PDF/adma2011/session3D/adma11_conf_32.pdf (accessed October 10, 2012)
Timonen, M.: Categorization of very short documents. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 5–16 (2012)
Google Scholar
Timonen, M.: Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion. PhD thesis, University of Helsinki, Faculty of Science, Department of Computer Science (2013)
Google Scholar
Ohsawa, Y., Benson, N.E., Yachida, M.: KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings of IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL 1998), pp. 12–18 (1998)
Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference (FLAIR 2003), USA, pp. 392–396 (2003)
Google Scholar
Wan, X., Xiao, J.: CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In: Proceedings of 22nd International Conference on Computational Linguistics (COLING 2008), United Kingdom, pp. 969–976 (2008)
Google Scholar
Timonen, M., Toivanen, T., Teng, Y., Chen, C., He, L.: Informativeness-based keyword extraction from short documents. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012), Spain, pp. 411–421 (2012)
Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 1999), Sweden, pp. 668–673 (1999)
Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries (DL 1999), USA, pp. 254–255 (1999)
Google Scholar
Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI 2003), Mexico, pp. 434–442 (2003)
Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Chapter Google Scholar
Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2, 303–336 (2000)
Article Google Scholar
Hulth, A., Karlgren, J., Jonsson, A., Boström, H., Asker, L.: Automatic keyword extraction using domain knowledge. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 472–482. Springer, Heidelberg (2001)
Chapter Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Japan, pp. 216–223 (2003)
Google Scholar
Hulth, A.: Enhancing linguistically oriented automatic keyword extraction. In: Proceedings of the Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004), USA, pp. 17–20 (2004)
Google Scholar
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management 43, 1705–1714 (2007)
Article Google Scholar
Paukkeri, M., Nieminen, I.T., Pöllä, M., Honkela, T.: A language-independent approach to keyphrase extraction and evaluation. In: Posters Proceedings of 22nd International Conference on Computational Linguistics (COLING 2008), United Kingdom, pp. 83–86 (2008)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, Spain, July 25-26, pp. 404–411 (2004)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford (1998)
Google Scholar
HaCohen-Kerner, Y.: Automatic extraction of keywords from abstracts. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 843–849. Springer, Heidelberg (2003)
Chapter Google Scholar
HaCohen-Kerner, Y., Gross, Z., Masa, A.: Automatic extraction and learning of keyphrases from scientific articles. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 657–669. Springer, Heidelberg (2005)
Chapter Google Scholar
Andrade, M., Valencia, A.: Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics 14, 600–607 (1998)
Article Google Scholar
Kim, S., Medelyan, O., Kan, M., Baldwin, T.: Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation (ACL 2010), pp. 21–26 (2010)
Google Scholar
Yih, W., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th international conference on World Wide Web (WWW 2006), Scotland, May 23-26, pp. 213–222 (2006)
Google Scholar
Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions (2003)
Google Scholar
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Brazil, pp. 353–360 (2005)
Google Scholar
Clark, K., Gale, W.: Inverse Document Frequency (IDF): A measure of deviation from Poisson. In: Third Workshop on Very Large Corpora, pp. 121–130. Massachusetts Institute of Technology, Cambridge (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

VTT Technical Research Centre of Finland, PO 1000, FI-02044, Espoo, Finland
Mika Timonen & Timo Toivanen
Department of Computer Science, University of Helsinki, PO 68, FI-00014, Finland
Melissa Kasari
Institute of Computer Applications, East China Normal University, No.500 Dongchuan Road, 200241, Shanghai, China
Yue Teng, Chao Cheng & Liang He

Authors

Mika Timonen
View author publications
You can also search for this author in PubMed Google Scholar
Timo Toivanen
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Kasari
View author publications
You can also search for this author in PubMed Google Scholar
Yue Teng
View author publications
You can also search for this author in PubMed Google Scholar
Chao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IST - Technical University of Lisbon, Av.Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
Delft University of Technology, Mekelweg 4, 2628, Delft, CD, The Netherlands
Jan L. G. Dietz
Informatics Research Centre, Henley Business School, University of Reading, RG6 6UD, UK
Kecheng Liu
INSTICC and IPS, Estefanilha, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Timonen, M., Toivanen, T., Kasari, M., Teng, Y., Cheng, C., He, L. (2013). Keyword Extraction from Short Documents Using Three Levels of Word Evaluation. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-54105-6_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54104-9
Online ISBN: 978-3-642-54105-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics