Abstract
We present a keyphrase extraction algorithm for scientific publications. Different from previous work, we introduce features that capture the positions of phrases in document with respect to logical sections found in scientific discourse. We also introduce features that capture salient morphological phenomena found in scientific keyphrases, such as whether a candidate keyphrase is an acronyms or uses specific terminologically productive suffixes. We have implemented these features on top of a baseline feature set used by Kea [1]. In our evaluation using a corpus of 120 scientific publications multiply annotated for keyphrases, our system significantly outperformed Kea at the pā<ā.05 level. As we know of no other existing multiply annotated keyphrase document collections, we have also made our evaluation corpus publicly available. We hope that this contribution will spur future comparative research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Frank, E., Paynter, G.W., Witten, H.I., Gutwin, C., Nevill-Manning, C.G.: Domain specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 668ā673 (1999)
Kim, W., Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms. J. Am. Soc. Inf. Sci. Technol.Ā 52, 247ā259 (2001)
Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions (2003)
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Proc. of the 13th Biennial Conf. of the Canadian Society on Computational Studies of Intelligence, pp. 40ā52. Springer, Heidelberg (2000)
Turney, P.D.: Learning to extract keyphrases from text. Technical Report ERB-1057, National Research Council, Institute for Information Technology (1999)
Turney, P.D.: Coherent keyphrase extraction via web mining. In: IJCAI 2003. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 434ā439 (2003)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, Mahwah (2005)
Dumais, S.T., Platt, J., Hecherman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: CIKM. Proc. of 7th International Conference on Information and Knowledge Management, pp. 148ā155 (1998)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus. In: BUG (2003)
Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pp. 296ā297. ACM Press, New York (2006)
Ratnaparkhi, A.: A maximum entropy part of speech tagger. In: Proc. ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, Philadelphia (1996)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Nguyen, T.D.: Automatic keyphrase generation. Technical report, National University of Singapore (2007)
Jones, S., Paynter, G.W.: Human evaluation of Kea, an automatic keyphrasing system. In: ACM/IEEE Joint Conference on Digital Libraries, pp. 148ā156 (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
Ā© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, T.D., Kan, MY. (2007). Keyphrase Extraction in Scientific Publications. In: Goh, D.HL., Cao, T.H., SĆølvberg, I.T., Rasmussen, E. (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers. ICADL 2007. Lecture Notes in Computer Science, vol 4822. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77094-7_41
Download citation
DOI: https://doi.org/10.1007/978-3-540-77094-7_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77093-0
Online ISBN: 978-3-540-77094-7
eBook Packages: Computer ScienceComputer Science (R0)