Skip to main content
Log in

DegExt: a language-independent keyphrase extractor

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2:303–336, 2000) and TextRank (Mihalcea and Tarau in Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Best results shown for N = 2.

  2. Based on this fact, every application of PageRank to undirected unweighted graphs may be replaced by much lighter approach based on the degree centrality ranking.

  3. This part may be skipped for multilingual processing unless appropriate stemmers and stopword lists are provided for different languages.

  4. Degree centrality assigns importance to a node proportional to its degree—the number of edges incident to it. In a directed graph we can make a distinction between the in-degree (the number input arcs) and the out-degree (number of output arcs). In this case, the degree of the node consists in the sum of the in-degree and of the out-degree of this node.

  5. Since we limit our phrases to three words at most, maximal length of a keyphrase cannot exceed three.

  6. http://www.haaretz.co.il.

  7. Dataset is available at http://www.cs.bgu.ac.il/~litvakm/research/.

  8. http://nlp.stanford.edu/software/tagger.shtml.

  9. http://www.cs.bgu.ac.il/~adlerm/freespace/tagger.zip.

  10. No standard stopword list for Hebrew exists.

  11. We define the size of a graph as the number of its vertices.

  12. Since the GenEx is commercial and language-specific tool (we have English version), we were unable to adapt its API to Hebrew, whereas DegExt and TextRank implementations were adapted to both languages.

  13. We used macro-averaging in our calculations.

  14. In our corpus it appears under name “doc1.txt”.

  15. Of course, true positive in this case is calculated against all single words in the gold standard.

  16. The version of GenEx tool that we used cannot be applied to any language except English.

References

  • Adler M (2007) Hebrew morphological disambiguation: an unsupervised stochastic word-based approach. PhD Thesis, Ben Gurion University

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117

    Article  Google Scholar 

  • Dermatas E, Kokkinakis G (1995) Automatic stochastic tagging of natural language texts. Comput Linguist 21:137–163

    Google Scholar 

  • DUC (2002) Document understanding conference. http://duc.nist.go

  • Goldberg Y, Tsarfaty RMA, Elhadad M (2009) Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and em-hmm-based lexical probabilities. In: Proceedings of the EACL 2009. Athens, Greece

  • Grineva M, Grinev M, Lizorkin D (2009) Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World wide web, pp 661–670

  • Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing. Japan

  • Lee J-W, Baik D-K (2004) A model for extracting keywords of document using term frequency and distribution. In: Proceedings of CICLING 2004

  • Li D, Li S, Li W, Wang W, Qu W (2010) A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In: Proceedings of the ACL 2010 conference short papers. Uppsala, Sweden, pp 296–300

  • Litvak M, Kisilevich S, Keim D, Lipman H, Gur AB, Last M (2010a) Towards language-independent summarization: a comparative analysis of sentence extraction methods on english and hebrew corpora. In: Proceedings of the CLIA workshop (COLING 2010). Beijing, China, pp 20–30

  • Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24

  • Litvak M, Last M, Aizenman H, Gobits I, Kandel A (2011) Degext—a language-independent graph-based keyphrase extractor. In: Proceedings of the 7th Atlantic web intelligence conference (AWIC’11). Fribourg, Switzerland, pp 121–130

  • Litvak M, Menahem M, Last M (2010b) A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Uppsala, Sweden, pp 927–936

  • Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165

    Article  MathSciNet  Google Scholar 

  • Mihalcea R, Tarau P (2004) Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain

  • Moore J, Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobashe B (1997) Web page categorization and feature selection using association rule and principal component clustering. In: 7th Workshop on information technologies and systems

  • Schenker A, Bunke H, Last M, Kandel A (2004) Classification of web documents using graph matching. Int J Pattern Recognit Artif Intell 18:475–496

    Article  Google Scholar 

  • Schenker A, Bunke H, Last M, Kandel A (2005) Graph-theoretic techniques for web content mining. World Scientific Pub Co Inc., Singapore

  • Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003. Edmonton, Canada, pp 252–259

  • Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2:303–336

    Article  Google Scholar 

  • Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries. Berkeley, California, USA, pp 254–255

Download references

Acknowledgments

We are grateful to our project students Hen Aizenman and Inbal Gobits from Ben-Gurion University for providing the TextRank implementation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Litvak.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Litvak, M., Last, M. & Kandel, A. DegExt: a language-independent keyphrase extractor. J Ambient Intell Human Comput 4, 377–387 (2013). https://doi.org/10.1007/s12652-012-0109-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-012-0109-z

Keywords

Navigation