Abstract
In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2:303–336, 2000) and TextRank (Mihalcea and Tarau in Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.
Similar content being viewed by others
Notes
Best results shown for N = 2.
Based on this fact, every application of PageRank to undirected unweighted graphs may be replaced by much lighter approach based on the degree centrality ranking.
This part may be skipped for multilingual processing unless appropriate stemmers and stopword lists are provided for different languages.
Degree centrality assigns importance to a node proportional to its degree—the number of edges incident to it. In a directed graph we can make a distinction between the in-degree (the number input arcs) and the out-degree (number of output arcs). In this case, the degree of the node consists in the sum of the in-degree and of the out-degree of this node.
Since we limit our phrases to three words at most, maximal length of a keyphrase cannot exceed three.
Dataset is available at http://www.cs.bgu.ac.il/~litvakm/research/.
No standard stopword list for Hebrew exists.
We define the size of a graph as the number of its vertices.
Since the GenEx is commercial and language-specific tool (we have English version), we were unable to adapt its API to Hebrew, whereas DegExt and TextRank implementations were adapted to both languages.
We used macro-averaging in our calculations.
In our corpus it appears under name “doc1.txt”.
Of course, true positive in this case is calculated against all single words in the gold standard.
The version of GenEx tool that we used cannot be applied to any language except English.
References
Adler M (2007) Hebrew morphological disambiguation: an unsupervised stochastic word-based approach. PhD Thesis, Ben Gurion University
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117
Dermatas E, Kokkinakis G (1995) Automatic stochastic tagging of natural language texts. Comput Linguist 21:137–163
DUC (2002) Document understanding conference. http://duc.nist.go
Goldberg Y, Tsarfaty RMA, Elhadad M (2009) Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and em-hmm-based lexical probabilities. In: Proceedings of the EACL 2009. Athens, Greece
Grineva M, Grinev M, Lizorkin D (2009) Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th international conference on World wide web, pp 661–670
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on empirical methods in natural language processing. Japan
Lee J-W, Baik D-K (2004) A model for extracting keywords of document using term frequency and distribution. In: Proceedings of CICLING 2004
Li D, Li S, Li W, Wang W, Qu W (2010) A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In: Proceedings of the ACL 2010 conference short papers. Uppsala, Sweden, pp 296–300
Litvak M, Kisilevich S, Keim D, Lipman H, Gur AB, Last M (2010a) Towards language-independent summarization: a comparative analysis of sentence extraction methods on english and hebrew corpora. In: Proceedings of the CLIA workshop (COLING 2010). Beijing, China, pp 20–30
Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24
Litvak M, Last M, Aizenman H, Gobits I, Kandel A (2011) Degext—a language-independent graph-based keyphrase extractor. In: Proceedings of the 7th Atlantic web intelligence conference (AWIC’11). Fribourg, Switzerland, pp 121–130
Litvak M, Menahem M, Last M (2010b) A new approach to improving multilingual summarization using a genetic algorithm. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Uppsala, Sweden, pp 927–936
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165
Mihalcea R, Tarau P (2004) Textrank—bringing order into texts. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain
Moore J, Han E-H, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobashe B (1997) Web page categorization and feature selection using association rule and principal component clustering. In: 7th Workshop on information technologies and systems
Schenker A, Bunke H, Last M, Kandel A (2004) Classification of web documents using graph matching. Int J Pattern Recognit Artif Intell 18:475–496
Schenker A, Bunke H, Last M, Kandel A (2005) Graph-theoretic techniques for web content mining. World Scientific Pub Co Inc., Singapore
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003. Edmonton, Canada, pp 252–259
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2:303–336
Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: practical automatic keyphrase extraction. In: Proceedings of the fourth ACM conference on digital libraries. Berkeley, California, USA, pp 254–255
Acknowledgments
We are grateful to our project students Hen Aizenman and Inbal Gobits from Ben-Gurion University for providing the TextRank implementation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Litvak, M., Last, M. & Kandel, A. DegExt: a language-independent keyphrase extractor. J Ambient Intell Human Comput 4, 377–387 (2013). https://doi.org/10.1007/s12652-012-0109-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-012-0109-z