Learning Algorithms for Keyphrase Extraction

Turney, Peter D.

doi:10.1023/A:1009976227802

Learning Algorithms for Keyphrase Extraction

Published: May 2000

Volume 2, pages 303–336, (2000)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Learning Algorithms for Keyphrase Extraction

Download PDF

Peter D. Turney¹

1218 Accesses
475 Citations
16 Altmetric
1 Mention
Explore all metrics

Abstract

Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by GenEx suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.

References

Brandow R, Mitze K and Rau LR (1995) The automatic condensation of electronic publications by sentence selection. Information Processing and Management, 31(5):675–685.
Google Scholar
Breiman L (1996a) Arcing Classifiers. Technical Report 460, University of California at Berkeley, Statistics Department.
Breiman L (1996b) Bagging predictors. Machine Learning, 24(2):123–140.
Google Scholar
Buntine W (1989) Stratifying samples to improve learning. In: Proceedings of the IJCAI-89Workshop on Knowledge Discovery in Databases, Detroit, Michigan
Carter C and Catlett J (1987) Assessing credit card applications using machine learning. IEEE Expert, Fall issue, 71–79.
Catlett J (1991) Megainduction: Machine learning on very large databases. PhD Dissertation, University of Sydney, Basser Department of Computer Science.
Croft WB, Turtle H and Lewis D (1991) The use of phrases and structured queries in information retrieval. In: SIGIR-91: Proceedings of the 14th Annual InternationalACMSIGIR Conference on Research and Development in Information Retrieval, New York, ACM, pp. 32–45.
Deming WE (1978) Sample surveys: The field. In: William H, Kruskal and Judith M Tanur, Eds., International Encyclopedia of Statistics. Free Press, New York.
Google Scholar
Edmundson HP (1969) Newmethods in automatic extracting. Journal of the Association for Computing Machinery, 16(2):264–285.
Google Scholar
Fagan JL (1987) Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. PhD Dissertation, Cornell University, Department of Computer Science, Report #87–868, Ithaca, New York.
Google Scholar
Feelders A and Verkooijen W (1995) Which method learns the most from data? Methodological issues in the analysis of comparative studies. In: Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Florida, pp. 219–225.
Google Scholar
Field BJ (1975) Towards automatic indexing: Automatic assignment of controlled-language indexing and classification from free indexing. Journal of Documentation, 31(4):246–265.
Google Scholar
Frank E, Paynter GW, Witten IH, Gutwin C and Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Morgan Kaufmann, California, pp. 668–673.
Google Scholar
Fraser DAS (1976) Probability and Statistics: Theory and Applications. Duxbury Press, Massachusetts.
Google Scholar
Freund Y and Schapire RE (1996) Experiments with a newboosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference (ICML-96), Morgan Kaufmann, California, pp. 148–156.
Google Scholar
Ginsberg A (1993) A unified approach to automatic indexing and information retrieval. IEEE Expert,8:46–56.
Google Scholar
Grefenstette JJ (1983) A user's guide to GENESIS. Technical Report CS-83–11, Vanderbilt University, Computer Science Department.
Grefenstette JJ (1986) Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16:122–128.
Google Scholar
Gutwin C, Paynter GW, Witten IH, Nevill-Manning CG and Frank E (1999) Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 27:81–104.
Google Scholar
Jang D-H and Myaeng SH (1997) Development of a document summarization system for effective information services. In: RIAO 97 Conference Proceedings: Computer-Assisted Information Searching on Internet, Montreal, Canada, pp. 101–111.
Johnson FC, Paice CD, Black WJ and Neal AP (1993) The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management, 1:215–241.
Google Scholar
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the Sixteenth Annual InternationalACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR'93, pp. 191–203.
Krulwich B and Burkey C (1996) Learning user information interests through the extraction of semantically significant phrases. In: Hearst M and Hirsh H, Eds., AAAI 1996 Spring Symposium on Machine Learning in Information Access. AAAI Press, California.
Google Scholar
Krupka G (1995) SRA: Description of the SRA system as used for MUC-6. In: Proceedings of the Sixth Message Understanding Conference, Morgan Kaufmann, California.
Kubat M, Holte R and Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/3):195–215.
Google Scholar
Kupiec J, Pedersen J and Chen F (1995) A trainable document summarizer. In: Fox EA, Ingwersen P and Fidel R, Eds., In: SIGIR-95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, pp. 68–73.
Leung C-H and Kan W-K (1997) A statistical learning approach to automatic indexing of controlled index terms. Journal of the American Society for Information Science, 48(1):55–66.
Google Scholar
Lovins JB (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31.
Google Scholar
Luhn HP (1958) The automatic creation of literature abstracts. I.B.M. Journal of Research and Development, 2(2):159–165.
Google Scholar
Maclin R and Opitz D (1997) An empirical evaluation of bagging and boosting. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), AAAI Press, pp. 546–551.
Marsh E, Hamburger H and Grishman R (1984) A production rule system for message summarization. In: AAAI-84, Proceedings of the American Association for Artificial Intelligence, AAAI Press/MIT Press, Cambridge, MA, pp. 243–246.
Mathieu J (1999) Adaptation of a keyphrase extractor for Japanese text. In: Proceedings of the 27th Annual Conference of the Canadian Association for Information Science (CAIS-99), Sherbrooke, Quebec, pp. 182–189.
MUC-3 (1991) In: Proceedings of the Third Message Understanding Conference, Morgan Kaufmann, California.
MUC-4 (1992) In: Proceedings of the Fourth Message Understanding Conference, Morgan Kaufmann, California.
MUC-5 (1993) In: Proceedings of the Fifth Message Understanding Conference, Morgan Kaufmann, California.
MUC-6 (1995) In: Proceedings of the Sixth Message Understanding Conference, Morgan Kaufmann, California.
MuQnoz A (1996) Compound key word generation from document databases using a hierarchical clustering ART model. Intelligent Data Analysis, 1(1): Elsevier, Amsterdam.
Google Scholar
Nakagawa H (1997) Extraction of index words from manuals. In: RIAO 97 Conference Proceedings: Computer-Assisted Information Searching on Internet, Montreal, Canada, pp. 598–611.
Paice CD (1990) Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26(1):171–186.
Google Scholar
Paice CD and Jones PA (1993) The identification of important concepts in highly structured technical papers. In: SIGIR-93: Proceedings of the 16th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, ACM, New York, pp. 69–78.
Porter MF (1980) An algorithm for suffix stripping. Program; Automated Library and Information Systems, 14(3):130–137.
Google Scholar
Quinlan JR (1987) Decision trees as probabilistic classifiers. In: P Langley, Ed., Proceedings of the Fourth International Workshop on Machine Learning, Morgan Kaufmann, California, pp. 31–37.
Quinlan JR (1990) Probabilistic decision trees. In: Kodratoff Y and Michalski RS, Eds., Machine Learning: An Artificial Intelligence Approach, Volume III, Morgan Kaufmann, California, pp. 140–152.
Google Scholar
Quinlan JR (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann, California.
Google Scholar
Quinlan JR (1996) Bagging, boosting, and C4.5. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), AAAI Press, pp. 725–730.
Salton G (1988) Syntactic approaches to automatic book indexing. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, ACM, New York, pp. 120–138.
Salton G, Allan J, Buckley C and Singhal A (1994) Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264:1421–1426.
Google Scholar
Soderland S and Lehnert W(1994) Wrap-Up: A trainable discourse module for information extraction. Journal of Artificial Intelligence Research, 2:131–158.
Google Scholar
Sparck Jones K (1973) Does indexing exhaustivity matter? Journal of the American Society for Information Science, September-October, 313–316.
Steier AM and Belew RK (1993) Exporting phrases: A statistical analysis of topical language. In: R Casey and B Croft, Eds., Second Symposium on Document Analysis and Information Retrieval, pp. 179–190.
Turney PD (1997) Extraction of keyphrases from text: Evaluation of four algorithms. National Research Council, Institute for Information Technology, Technical Report ERB-1051.
Turney PD (1999) Learning to extract keyphrases from text. National Research Council, Institute for Information Technology, Technical Report ERB-1057.
Whitley D (1989) The GENITOR algorithm and selective pressure. In: Proceedings of the Third International Conference on Genetic Algorithms (ICGA-89), Morgan Kaufmann, California, pp. 116–121.

Download references

Author information

Authors and Affiliations

National Research Council of Canada, Institute for Information Technology, Ottawa, Ontario, Canada, K1A 0R6
Peter D. Turney

Authors

Peter D. Turney
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Turney, P.D. Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000). https://doi.org/10.1023/A:1009976227802

Download citation

Issue Date: May 2000
DOI: https://doi.org/10.1023/A:1009976227802

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning Algorithms for Keyphrase Extraction

Abstract

Article PDF

Similar content being viewed by others

Natural Language Processing

Learning from positive and unlabeled data: a survey

Future Progress in Artificial Intelligence: A Survey of Expert Opinion

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Learning Algorithms for Keyphrase Extraction

Abstract

Article PDF

Similar content being viewed by others

Natural Language Processing

Learning from positive and unlabeled data: a survey

Future Progress in Artificial Intelligence: A Survey of Expert Opinion

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation