DegExt: a language-independent keyphrase extractor

Litvak, Marina; Last, Mark; Kandel, Abraham

doi:10.1007/s12652-012-0109-z

DegExt: a language-independent keyphrase extractor

Original Research
Published: 29 February 2012

Volume 4, pages 377–387, (2013)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Marina Litvak¹,
Mark Last² &
Abraham Kandel³

347 Accesses
10 Citations
Explore all metrics

Abstract

In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2:303–336, 2000) and TextRank (Mihalcea and Tarau in Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Graph Based Approach on Extractive Summarization

Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects

LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

Notes

Best results shown for N = 2.
Based on this fact, every application of PageRank to undirected unweighted graphs may be replaced by much lighter approach based on the degree centrality ranking.
This part may be skipped for multilingual processing unless appropriate stemmers and stopword lists are provided for different languages.
Degree centrality assigns importance to a node proportional to its degree—the number of edges incident to it. In a directed graph we can make a distinction between the in-degree (the number input arcs) and the out-degree (number of output arcs). In this case, the degree of the node consists in the sum of the in-degree and of the out-degree of this node.
Since we limit our phrases to three words at most, maximal length of a keyphrase cannot exceed three.
http://www.haaretz.co.il.
Dataset is available at http://www.cs.bgu.ac.il/~litvakm/research/.
http://nlp.stanford.edu/software/tagger.shtml.
http://www.cs.bgu.ac.il/~adlerm/freespace/tagger.zip.
No standard stopword list for Hebrew exists.
We define the size of a graph as the number of its vertices.
Since the GenEx is commercial and language-specific tool (we have English version), we were unable to adapt its API to Hebrew, whereas DegExt and TextRank implementations were adapted to both languages.
We used macro-averaging in our calculations.
In our corpus it appears under name “doc1.txt”.
Of course, true positive in this case is calculated against all single words in the gold standard.
The version of GenEx tool that we used cannot be applied to any language except English.

References

Adler M (2007) Hebrew morphological disambiguation: an unsupervised stochastic word-based approach. PhD Thesis, Ben Gurion University
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117
Article Google Scholar
Dermatas E, Kokkinakis G (1995) Automatic stochastic tagging of natural language texts. Comput Linguist 21:137–163
Google Scholar
DUC (2002) Document understanding conference. http://duc.nist.go
Goldberg Y, Tsarfaty RMA, Elhadad M (2009) Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and em-hmm-based lexical probabilities. In: Proceedings of the EACL 2009. Athens, Greece
Grineva M, Grinev M, Lizorkin D (2009) Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World wide web, pp 661–670
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing. Japan
Lee J-W, Baik D-K (2004) A model for extracting keywords of document using term frequency and distribution. In: Proceedings of CICLING 2004
Li D, Li S, Li W, Wang W, Qu W (2010) A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In: Proceedings of the ACL 2010 conference short papers. Uppsala, Sweden, pp 296–300
Litvak M, Kisilevich S, Keim D, Lipman H, Gur AB, Last M (2010a) Towards language-independent summarization: a comparative analysis of sentence extraction methods on english and hebrew corpora. In: Proceedings of the CLIA workshop (COLING 2010). Beijing, China, pp 20–30
Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24
Litvak M, Last M, Aizenman H, Gobits I, Kandel A (2011) Degext—a language-independent graph-based keyphrase extractor. In: Proceedings of the 7th Atlantic web intelligence conference (AWIC’11). Fribourg, Switzerland, pp 121–130
Litvak M, Menahem M, Last M (2010b) A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Uppsala, Sweden, pp 927–936
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165
Article MathSciNet Google Scholar
Mihalcea R, Tarau P (2004) Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain
Moore J, Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobashe B (1997) Web page categorization and feature selection using association rule and principal component clustering. In: 7th Workshop on information technologies and systems
Schenker A, Bunke H, Last M, Kandel A (2004) Classification of web documents using graph matching. Int J Pattern Recognit Artif Intell 18:475–496
Article Google Scholar
Schenker A, Bunke H, Last M, Kandel A (2005) Graph-theoretic techniques for web content mining. World Scientific Pub Co Inc., Singapore
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003. Edmonton, Canada, pp 252–259
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2:303–336
Article Google Scholar
Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries. Berkeley, California, USA, pp 254–255

Download references

Acknowledgments

We are grateful to our project students Hen Aizenman and Inbal Gobits from Ben-Gurion University for providing the TextRank implementation.

Author information

Authors and Affiliations

Department of Software Engineering, Sami Shamoon Academic College of Engineering, 84100, Beer-Sheva, Israel
Marina Litvak
Department of Information System Engineering, Ben-Gurion University of the Negev, 84105, Beer-Sheva, Israel
Mark Last
Department of Computer Science and Engineering, University of South Florida, Tampa, FL, 33620, USA
Abraham Kandel

Authors

Marina Litvak
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar
Abraham Kandel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Litvak.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Litvak, M., Last, M. & Kandel, A. DegExt: a language-independent keyphrase extractor. J Ambient Intell Human Comput 4, 377–387 (2013). https://doi.org/10.1007/s12652-012-0109-z

Download citation

Received: 02 May 2011
Accepted: 14 February 2012
Published: 29 February 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s12652-012-0109-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DegExt: a language-independent keyphrase extractor

Abstract

Access this article

Similar content being viewed by others

A Graph Based Approach on Extractive Summarization

Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects

LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DegExt: a language-independent keyphrase extractor

Abstract

Access this article

Similar content being viewed by others

A Graph Based Approach on Extractive Summarization

Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects

LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation