1 Introduction

Document summaries should use a minimum number of words to express a document’s main ideas. As such, high quality summaries can significantly reduce the constant information overload of many professionals in a variety of fields, assist in the automated classification and filtering of documents, and increase search engines precision.

Automated summarization methods can be categorized as either statistic-based, which use either the classic vector space model or graph representations, or as semantic-based, which employ ontologies and language-specific knowledge. Both categories can contain supervised, or corpus-based, machine-learning techniques, as well as unsupervised approaches. Automated summarization methods can use different levels of linguistic analysis: morphological, syntactic, semantic and discourse/pragmatic (Mani 2001).

Although the summary quality is expected to improve when a summarization technique includes language-specific knowledge, the dependence on such knowledge impedes the use of the same summarizer for multiple languages. On the other hand, the publication of information on the Internet in an ever-increasing variety of languages Footnote 1 dictates the importance of developing multilingual summarization approaches. Thus, there is a particular need for language-independent statistical techniques that can be readily applied to texts in any language without depending on language-specific linguistic tools.

This work is focusing on a multi-lingual summarization: we evaluate here two different approaches to the cross-lingual training of a supervised algorithm for single-document summarization—MUSE (Litvak et al. 2010b). Summarization with MUSE is considered an optimization problem, and a Genetic Algorithm (GA) is applied in order to find an optimal weighted linear combination of 31 statistical sentence features that are all language-independent. Generally speaking, our methodology described in Litvak et al. (2010b) is about finding a sentence ranking model based on a linear combination of some sentence features—by applying a Genetic Algorithm. The induced model can be applied for summarizing documents in the same or different language/genre.

We have performed our evaluation experiments on three monolingual corpora of English, Hebrew, and Arabic documents, and six parallel corpora resulted from the pairwise machine translation of each corpus from its original language to the other two. The experiments were aimed at evaluating our approach in both mono-lingual and multilingual environments as well as comparing it to the state-of-the-art summarization methods and optimization approaches.

This article is organized as follows. The next section describes the related work in extractive summarization. Section 3 describes MUSE, the GA-based approach to multilingual single-document extractive summarization, and possible scenarios for a cross-lingual training. Section 4 presents our experimental results for the multilingual summarization task and cross-lingual training of MUSE. Our conclusions and suggestions for future work comprise the final section.

2 Related work

Extractive summarization is aimed at the selection of a subset of the most relevant fragments from a source text into the summary. The fragments can be paragraphs (Salton et al. 1997), sentences (Luhn 1958), keyphrases (Turney 2000; Litvak et al. 2011) or keywords (Litvak and Last 2008). Extractive summarization usually consists of ranking, where each fragment of a summarized text gets a relevance score, and extraction, where the top-ranked fragments are gathered into a summary, according to their appearance in the original text. Statistical methods for calculating the relevance score of each fragment can be categorized into several categories: cue-based (Edmundson 1969), keyword- or frequency-based (Luhn 1958; Edmundson 1969; Neto et al. 2000; Steinberger and Jezek 2004; Kallel et al. 2004; Vanderwende et al. 2007), title-based (Edmundson 1969; Teufel and Moens 1997), position-based (Baxendale 1958; Edmundson 1969; Lin and Hovy 1997; Nobata et al. 2001) and length-based (Nobata et al. 2001). In our approach, we use 31 language-independent sentence features from various categories.

Considered the first work on sentence scoring for automated text summarization, the seminal paper by Luhn (1958) based the significance factor of a sentence on the frequency and the relative positions of significant words within a sentence. Edmundson (1969) tested different linear combinations of four scoring features for ranking sentences—cue, key, title and position—to identify the one with the best performance for a training corpus. Linear combinations of several statistical sentence ranking features were also applied in the MEAD (Radev et al. 2001) and SUMMA (Saggion et al. 2003) approaches, both of which use the vector space model for text representation and a set of predefined or user-specified weights for a combination of position, frequency, title, and centroid-based (MEAD) features. Goldstein et al. (1999) integrated linguistic and statistical features. In none of these works, however, did the researchers attempt to find the optimal weights for the best linear combination. Later, attempts to find the best combination had been done using machine learning techniques, like in Wong et al. (2008), where supervised and semi-supervised learning approaches were applied to various sentence features. Different groups of features from four categories were manually constructed and evaluated, and, finally, 14 features from three categories were found as the best combination. In our work, we continue these attempts by the supervised learning of the best linear combination of 31 sentence features from a training set of annotated documents. Our approach applies a global search technique to a full set of features and does not require to construct different combinations manually.

Some authors reduced the summarization process to an optimization or a search problem. Hassel and Sjobergh (2006) used a standard hill-climbing algorithm to build summaries that maximize the score for the total impact of the summary. A summary consisting of the first sentences from the document was used as a starting point for the search, and all neighbors (summaries that can be created by simply removing one sentence and adding another) were examined, looking for a better summary. Aker et al. (2010) used the A* search algorithm to find the best extractive summary up to a given length, which is both optimal and computationally efficient. Ouyang et al. (2011) applied regression models to query-focused multi-document summarization, where they used Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In our work, we use a Genetic Algorithm (GA), which is known as a prominent search and optimization method (Goldberg 1989), for optimizing a linear combination of multiple sentence features.

Alfonseca and Rodriguez (2003), Kallel et al. (2004) and Liu et al. (2006b) used GAs in order to find sets of sentences that maximize summary quality metrics, starting from a random selection of sentences as the initial population. In this setting, however, the high computational complexity of GAs is a disadvantage. To choose the best summary, multiple candidates should be generated and evaluated for each document (or document cluster). Following a different approach, Turney (2000) used a GA to learn an optimized set of parameters for a keyword extractor embedded in the Extractor tool. Footnote 2 Orăsan et al. (2000) enhanced the preference-based anaphora resolution algorithms by using a GA to find an optimal set of values for the outcomes of 14 indicators and apply the optimal combination of values obtained from data on one text to a different text. With such an approach, training may be the only time-consuming phase in the process. The detailed description of our approach to using a GA for optimizing the sentence feature weights can be found in the next section.

All corpus-based approaches have one common problem—they need to be retrained for each new language and genre. However, preparing annotated corpora for multiple languages is a very labor-intensive and time-consuming process, especially for rare languages. Nowadays, the use of parallel corpora is very popular in different areas of information retrieval and computational linguistics, including cross-lingual summarization. The researchers in Wan et al. (2010) have adopted the Late Translation (LateTrans) strategy, using machine translation, for cross-lingual summarization. They evaluated the translation quality of each sentence in the English-to-Chinese summarization of a given document or a document set, and, finally, the English sentences with high translation quality and high informativeness were selected and translated to form the Chinese summary. In this article we show empirically that using parallel corpora can be helpful for training corpus-based summarization techniques when no training data for a new language is available.

Various text representation models for summarized documents had been utilized across different approaches. Today, graphs are becoming increasingly popular, due to their ability to enrich the document model with syntactic and semantic relations. Erkan and Radev (2004) and Mihalcea (2005) introduced LexRank and TextRank, respectively—algorithms for unsupervised extractive summarization that rely on the application of iterative graph-based ranking algorithms, such as PageRank (Brin and Page 1998) and HITS (Kleinberg 1999). Their methods represent a document as a graph of sentences interconnected by similarity relations. Various similarity functions can be applied: cosine similarity as in LexRank (Erkan and Radev 2004), simple overlap as in TextRank (Mihalcea 2005), or other functions. Edges representing the similarity relations can be weighted (Mihalcea 2005) or unweighted (Erkan and Radev 2004): two sentences are connected if their similarity is above some predefined threshold value. Wan (2008) applied the graph-based ranking algorithm based on each kind of sentence relationship for generic multi-document summarization, and integrated the relevance of the sentences to the specified topic into the graph-ranking based method for topic-focused multi-document summarization. In MUSE, we use two graph-based models, which are based on sentence and word segmentation, respectively.

It is worth noting that our work is aimed at a generic summarization representing the author’s point of view that is different from a query-based summarization focusing on material of interest to the user (Hovy 2001). While in generic summarization the only input for a system is a document (or documents) to summarize, in query-based summarization a query expressing the user’s interest has to be provided. A query-based summary must contain the information relevant to a given query.

3 MUSE: MUltilingual Sentence Extractor

MUltilingual Sentence Extractor is a supervised learning approach to language-independent extractive summarization, where the best set of weights for a linear combination of sentence scoring methods is found by a genetic algorithm trained on a collection of document summaries (see Algorithm 1). Formally, the model for sentence scoring can be expressed by the following formula:

$$ Score = \sum w_i \times r_i $$

where r i is the value of ith sentence feature (one of 31 described below) and w i is its weight in the linear combination.

Algorithm 1 Step 1: Training

The weighting vector thus obtained is to be used for sentence scoring in future summarizations. The sentences with the highest score are then selected for the summary, according to the greedy approach presented in Algorithm 2.

Algorithm 2 Step 2: Summarizing a new document

Since most sentence scoring methods have a linear computational complexity, only the training phase of our approach is time-consuming.

Figure 1 depicts the flowchart of the proposed approach. It consists of two main modules: the training module activated offline, and the summarization module operating online. Both modules utilize three different representations of documents: one vector-based and two graph-based (see Sect. 3.2). The preprocessing sub-module is responsible for constructing each representation, and it is embedded in both modules. Algorithms 1 and 2 contain the pseudo-code for two independent phases of MUSE: training and summarization, respectively.

Fig. 1
figure 1

MUSE summarization flowchart

The training module receives as input a corpus of documents, each accompanied by one or several gold-standard summaries—abstracts or extractsFootnote 3—compiled by human assessors. The set of documents may be either monolingual or multilingual and their summaries have to be in the same language as the original text. As a second parameter, the module obtains a user-specified set of sentence features computed by the system. Then, the training module applies a genetic algorithm to sentence-feature matrices of precomputed sentence scores for each input feature with the purpose of finding the best linear combination of features using ROUGEFootnote 4 as a fitness function. The output of the training module is a vector of weights for user-specified sentence ranking features.

The summarization module performs an on-line summarization of input text/texts. Each sentence of an input text document obtains a relevance score according to the trained model, and the top ranked sentences are extracted to the summary in their original order. To avoid duplicate content, a new sentence is added if and only if it is not similar to the previously selected sentences. In our experiments, we used cosine similarity measure with a threshold of 0.8. The length of resulting summaries is limited by a user-specified value (maximum number of words or sentences in the text extract or a maximum extract-to-text ratio). The summarization module is expected to use the model trained on the same language as input texts. However, if such model is not available (no annotated corpus in the text language), the user can choose from the following: (1) a model trained on some other language/corpus can be used (in Sect. 4 we explore whether the same model can be efficiently used across different languages), or (2) a model can be trained on a parallel corpus generated by a machine translation tool.

The preprocessing submodule performs the following tasks: (1) sentence segmentation, (2) word segmentation, (3) stopwords removal,Footnote 5 (4) vector space model construction using tf and/or tf-idf weights, (5) a word-based graph representation construction, (6) a sentence-based graph representation construction, and (7) document metadata extraction. The outputs of this submodule are: sentence segmented text, vector space model, and the document graphs. Both modules—training and summarization—use all three representation modules for calculation of sentence features. It is worthwhile to note that, proper sentence segmentation is crucial for the quality of extractive summarization results. Since sentence and word segmentation are language-dependent,Footnote 6 these parts should be integrated and configured for each language by the end-user of our system. So far, we have used the sentence splitter provided with the MEAD summarizer (Radev et al. 2001) for English sentences,Footnote 7 and a simple splitter that can split the text at periods, exclamation points, or question marks for the Hebrew and Arabic texts. In the future we intend to utilize a fully language-independent technique for text segmentation based on n-grams.

3.1 Language-independent sentence scoring features

MUltilingual Sentence Extractor is aimed at identifying the best linear combination of language-independent sentence scoring features. Table 1 shows the complete list of 31 sentence features used in this article. Each feature description includes a reference to the original work where the method was proposed for extractive summarization. Several methods were proposed by us in Litvak et al. (2010b). Formulas incorporate the following notation: a sentence is denoted by S, a text document by D, the total number of words in S by N, the total number of sentences in D by n, the sequential number of S in D by i, and the in-document term frequency of the term t by tf(t). In the LUHN method, W i and N i are the number of keywords and the total number of words in the ith cluster, respectively, whereas clusters are sentence portions bracketed by keywords, i.e., frequent, non-common words.Footnote 8

Table 1 Sentence scoring features (Litvak et al. 2010b)

Due to the multilingual focus of our work, exact word matching was used in all similarity-based methods. From the same reason, we kept two kinds of length features: number of words and number of charactersFootnote 9 in the sentence.

Figure 2 demonstrates the taxonomy of the 31 features listed in Table 1. The features are divided into three main categories—structure-, vector-, and graph-based—according to the type of text representation they use, where each category is divided into sub-categories according to the main calculating criteria. For example, the “graph-based” category contains all features that use graph representation module (word- and sentence-based), and its “pagerank” sub-category combines features based on the eigenvector node (standing for word or sentence) centrality. Features that require pre-defined threshold values are marked with a cross and listed in Table 2 together with the average threshold values obtained after method evaluation on English, Hebrew, and Arabic corpora. Each feature was evaluated on three corpora, with different thresholds \(t \in [0, 1]\) (only values with one decimal digit were considered). The threshold values that resulted in the best ROUGE scores across three corpora, as a result of training on the entire corpus,Footnote 10 were selected. A threshold of 1 means that all terms are considered, while a value of 0 means that only terms with the highest absolute score of tf, degree, or pagerank (depends on a feature) are considered.

Fig. 2
figure 2

Taxonomy of language-independent sentence scoring features (Litvak et al. 2010b)

Table 2 Selected thresholds for threshold-based scoring methods (Litvak et al. 2010a)

Section 3.3 describes our application of a Genetic Algorithm to the summarization task.

3.2 Text representation models

The vector-based scoring methods listed in Table 1 use tf or tf-idf term weights to evaluate sentence importance. In contrast, representation used by the graph-based methods (all except TextRank) is based on the word-based graph representation models described in Schenker et al. (2004). Schenker et al. (2005) showed that such graph representations can outperform the vector space model on several text mining tasks. In the word-based graph representation used in our work, nodes represent unique terms (distinct words) and edges represent order-relationships between two terms. There is a directed edge from A to B if an A occurrence immediately precedes the B occurrence in any sentence of the document. Contrary to Schenker et al. (2005), we have labeled each edge with the IDs of sentences that contain both words in the specified order. For the TextRank score calculation (denoted by ML_TR in Table 1), we build a sentence-based graph representation where nodes stand for sentences and edges for similarity relationships.

3.3 Optimization: learning the best linear combination

We found the best linear combination of the features listed in Table 1 using a Genetic Algorithm (GA). GAs are categorized as global search heuristics. Figure 3 (Litvak et al. 2010b) shows a simplified flowchart of a Genetic Algorithm.

Fig. 3
figure 3

GA flowchart

A typical genetic algorithm requires (1) a genetic representation of the solution domain, (2) a fitness function to evaluate the solution domain, and (3) selection and reproduction rules.

We represent each solution as a vector of weights for a linear combination of sentence scoring features—real-valued numbers in an unlimited range, normalized in such a way that they sum up to 1. The vector size is fixed and it equals to the number of features used in the combination.

Defined over the genetic representation, the fitness function measures the quality of the represented solution. We use ROUGE-1 and ROUGE-2, Recall (Lin and Hovy 2003) as fitness functions for measuring summarization quality—similarity with gold standard summaries, which should be maximized during the training (optimizing procedure). For a training set, we use an annotated corpus of summarized documents, where each document is accompanied by several human-generated summaries—abstracts or extracts.Footnote 11

Below each phase of the optimization procedure is described in detail.

Initialization:

GA explores only a small part of the search space if the population is too small, whereas it slows down if there are too many solutions. We start from N = 500 randomly generated genes/solutions as an initial population, that empirically was proven as a good choice during our experiments. Each gene is represented by a weighting vector \(v_i = {w_1, \ldots, w_D}\) with a fixed number D of elements that equals to the number of sentence featuresFootnote 12 used in linear combination. All elements are generated from a standard normal distribution, with μ = 0 and σ2 = 1, and normalized to sum up to 1. For this solution’s representation, a negative weight, if it occurs, can be considered as a “penalty” for the associated feature.

Selection:

During each successive generation, a proportion of the existing population is selected to breed a new generation. We use a truncation selection method that rates the fitness of each solution and selects the best fifth (100 out of 500) of the individual solutions, i.e., getting the maximal ROUGE value. In such manner, we discard “bad” solutions and prevent them from reproducing. In addition, we use elitism—a method that prevents losing the best found solution in the population by copying it to the next generation.

Reproduction:

At this stage, new genes/solutions are introduced into the population, i.e., new points in the search space are explored. These new solutions are generated from those selected through the following genetic operators: mating, crossover, and mutation.

In mating, a pair of “parent” solutions is randomly selected, and a new solution (child) is created using crossover and mutation, which are the most important parts of a genetic algorithm. The GA performance is influenced mainly by these two operators. New parents are selected for each new child, and the process continues until a new population of solutions of appropriate size N is generated.

Crossover is performed under the assumption that new solutions can be improved by re-using the good parts of old solutions. However it is beneficial to keep and transfer some part of population from one generation to the next. Our crossover operator includes a probability (80 %) that a new and different offspring solution is generated by calculating the weighted average of two “parent” vectors according to Vignaux and Michalewicz (1991). Formally, a new vector v is created from two vectors v 1 and v 2 according to the formula v = λ * v 1 + (1 − λ) * v 2 (we set λ = 0.5). There is a probability of 20 % that the offspring is a duplicate of one of its parents. The reason for allowing dupicates in some cases is a balancing between exploration and exploitation—a very high crossover rate relative to selective pressure, given high initial variability and a very large population, may turn evolution into a random search (Goldberg 1989).

Mutation in GAs functions both in preserving the existing diversity and introducing new variation. It is aimed at preventing the GA from falling into a local extremum, but it should not be applied too often, due to the danger of transforming the GA into a random search. The mutation operator introduced here includes a probability (3 %) that an arbitrary weight in a vector would be changed by a uniformly randomized factor in the range of [−0.3, 0.3] around its original value.

Termination:

The generational process is repeated until a termination condition—a plateau of solution/combination fitness such that successive iterations no longer produce significantly better results—has been reached. In our implementation, just one iteration must show no significant improvement in the best individual fitness before the termination. The minimal improvement in our experiments was set to \(\epsilon = 1.0E-21.\) Footnote 13

3.4 Training scenarios

The training of MUSE can be performed according to monolingual and/or cross-lingual scenarios, depending on either of the following:

  1. 1.

    A training corpus in the target language is available. Since MUSE is language-independent, it can be trained on a corpus of summarized documents in any target language. This approach is called “monolingual training”.

  2. 2.

    A training corpus in the target language is not available, but there is a training corpus in a different (“source”) language. Here several options can be considered:

  1. (a)

    One may train MUSE on the existing corpus and use the same trained model across different languages. Figure 4 depicts the flowchart of such an approach which is called “cross-lingual training”. This approach is quite problematic since, despite the language-independency of MUSE, different languages may have different trained models. The next scenario is aimed to solve the tradeoff between expensive manual annotation and multilingual summarization performance.

  2. (b)

    In order to obtain language-oriented trained modelsFootnote 14 in the case of a lack of data in the target language, one may translate a corpus from source to target language using machine translation tools, and use the parallel corpora for training. We propose a methodology for cross-lingual training of a summarization system that is based on the early translation strategy, where each document in the training corpus is translated to the target language prior to model learning. The flowchart of this scenario is depicted in Fig. 5.

Fig. 4
figure 4

Cross-lingual training using source-language corpora

Fig. 5
figure 5

Cross-lingual training using parallel corpora

Note that MUSE and ROUGE can be replaced in the figures above with any corpus-based summarizer and evaluation tool, respectively. Both scenarios for cross-lingual learning are generally applicable to any language-independent summarizer or language-dependent summarizer adapted to the source/target language.

3.5 Complexity issues

Assuming efficient implementation, most sentence ranking methods used by MUSE have a linear computational complexity relative to the total number of words in a document—O(n). As a result, MUSE document summarization time, given a trained model, is also linear in the number of features in a combination. The training time is proportional to the number of GA iterations multiplied by the number of individuals in a population, times the fitness evaluation (ROUGE) time. On average, in our experiments the GA performed only 5–6 iterations of selection and reproduction before reaching convergence.

4 Experiments

4.1 Overview

The MUSE summarization approach and the quality of its cross-lingual training were evaluated using comparative experiments on three monolingual corpora of English, Hebrew, and Arabic texts. These languages were intentionally chosen, since they belong to distinct language families (Indo-European and Semitic languages, respectively), to ensure that the results of our evaluation would be widely generalizable. The specific goals of the experiment were:

  1. 1.

    Evaluate the optimal sentence scoring models induced from the corpora of summarized documents in three different languages,

  2. 2.

    Determine whether the same sentence scoring model could be efficiently used for extractive summarization across three different languages,

  3. 3.

    Determine whether using parallel corpora in cross-lingual training improves the multilingual performance of MUSE,

  4. 4.

    Compare the performance of the GA-based summarization method to the state-of-the-art approaches, and

  5. 5.

    Compare the GA performance to alternative optimization techniques, viz. Multiple Linear Regression (MLR).

The following subsections describe: our experimental setup (data, evaluation metrics and scenarios), experimental results, and their discussion.

4.2 Experimental setup

4.2.1 Corpora

The English text material used in the experiments comprised the corpus of summarized documents available for the summarization task at the Document Understanding Conference 2002 (DUC 2002). This benchmark dataset contains 533 news articles, each accompanied by two to three human-generated abstracts of approximately 100 words each.

For the Arabic language, we generated (in collaboration with several experts in Arabic) a new corpus compiled from 90 news articles. Each article was summarized by three native Arabic speakers selecting the most important sentences into an extractive summary of approximately 100 words each. All assessors were provided with the Tool Assisting Human Assessors (TAHA) software toolFootnote 15 that enables sentences to be easily selected and stored for later inclusion in the document extract. The agreement between assessors measured by ROUGE-1 (Lin and Hovy 2003) score shows that their summaries overlap by 75 % on average.

For the Hebrew language, we used the corpus generated as part of our experimentFootnote 16 where 120 news articles of 250–830 words each from the websites of the Haaretz newspaper,Footnote 17 The Marker newspaper,Footnote 18 and manually translated articles from WikiNews Footnote 19 were summarized by human assessors using the TAHA software. In total, 126 undergraduate students from the Department of Information Systems Engineering, Ben Gurion University of the Negev participated in the experiment. Each participant was randomly assigned ten different documents and instructed to choose the most important sentences in each document subject to the following constraints: (1) spend at least five minutes on each document, (2) ignore dialogs and quotations, (3) read the whole document before beginning sentence extraction, (4) ignore redundant, repetitive, and overly detailed information, and (5) remain within the minimal and maximal summary length limits (95 and 100 words, respectively). Summaries were assessed for quality by comparing each student’s extract to those of all the other students using the ROUGE evaluation toolkit and the ROUGE-1 metric. We discarded the summaries produced by assessors who received an average ROUGE score below 0.5, i. e. agreed with the rest of assessors in less than 50 % of cases. Also, the time spent by an assessor on each document was checked (with respect to the requirements). The final corpus of summarized Hebrew texts was compiled from the summaries of about 60 % of the assessors, with an average of five extracts per single document. The average ROUGE scores of the selected assessors is 54 %. The dataset is available at http://www.cs.bgu.ac.il/~litvakm/research/.

Three corpora have different characteristics of the gold standard summaries with respect to the following parameters:

  • type of a summary: the English corpus contains abstracts, whereas the Hebrew and Arabic corpora both contain extracts (extracted sentences);

  • number of summaries per document: the English corpus contains from two to three summaries, the Arabic has exactly three extracts, and the Hebrew corpus consists of five extracts per document on average;

  • diversity of summaries: while the Hebrew corpus contains the most diverse summaries (each assessor summarized only ten documents from 120), the Arabic corpus has the most consistent summaries since the same three assessors summarized all corpus documents;

  • coverage of summaries: in the Arabic and Hebrew corpora many extracts are compiled of initial sentence/sentences, while English abstracts contain information representing all sentences in the source document.

The documents from all corpora have a title as the first sentence. The parallel corpora were obtained by machine translating each one of the three monolingual corpora from the source language to two target languages using Google Translate APIFootnote 20 and sentence segmentation.

4.2.2 Evaluation metrics

We evaluated English, Hebrew, and Arabic summaries using ROUGE-1 and ROUGE-2 metrics, described in Lin (2004b). Similar to Lin’s conclusion in Lin (2004b), our results for the different ROUGE metrics were not statistically distinguishable. However, ROUGE-1 showed the largest variation across the methods and, according to the conclusion made in Lin (2004a), ROUGE-2 is a good choice in single-document summarization tasks. In the following comparisons, all results are presented in terms of the ROUGE-1 and ROUGE-2 Recall metrics. In order to use the ROUGE toolkit on Hebrew and Arabic, it was adapted to these languages by specifying the regular expressions for a single “word” using Hebrew and Arabic characters.

4.2.3 Evaluation scenarios

According to the goals of our experiment listed in Sect. 4.1 above, we performed the following evaluations:

  1. 1.

    We evaluated the monolingual training of MUSE on each monolingual corpus using 10-fold cross validation.

  2. 2.

    We compared the MUSE approach with the following unsupervised summarization methods:

    1. (a)

      a multilingual version of TextRank (Mihalcea 2005) as the best known multilingual summarizerFootnote 21 (denoted as ML_TR in Table 1),

    2. (b)

      degree-based Coverage (denoted as COV_DEG in Table 1) as the best single scoring method in English corpus, and

    3. (c)

      the Baseline approach compiling the summaries from the initial sentences (denoted as POS_F in Table 1). The baseline approach was found to be the best single feature in Arabic and Hebrew corpora.Footnote 22

  3. 3.

    As a part of the monolingual experiment, we compared the performance of two different optimization techniques used to calculate the optimal linear combination of sentence features. Since a Genetic Algorithm is known as a time- and space-consuming technique, it was compared to a common and simple optimization method—Multiple Linear Regression. For the experiment, the corpus of English summarized documents (DUC 2002) was utilized. We have calculated 31 features (see Table 1) as well as a ROUGE score for each sentence of the corpus documents,Footnote 23 where the ROUGE value represented the sentence relevance score for inclusion in the document summary. Then, the Least Squares AlgorithmFootnote 24 was run, in order to estimate a multiple linear regression model—a linear combination of 31 features—with a ROUGE score as the dependent variable, predicting the sentence relevance score in the future summarization, and 31 independent predictor variables representing 31 sentence features.

  4. 4.

    We evaluated the quality of cross-lingual training with MUSE by applying the model trained on a corpus in one (source) language to documents in another (target) language.

  5. 5.

    The last phase of our experiment was to determine whether using parallel corpora in cross-lingual training improves the multilingual performance of MUSE. All available data (translated and target corpora, respectively) was used for training and testing the summarizer using parallel corpora. The 10-fold cross validation was applied.

Three research hypotheses were tested performing three different statistical tests formulated in Table 3. The research (“alternative”) hypotheses are shown in the right column of Table 3. In order to perform the testing, the results were analyzed and compared to the null hypotheses (shown in the left column of Table 3) using paired t test or Wilcoxon matched-pairs signed-ranks test, according to whether the data passed the normality test using the method of Kolmogorov and Smirnov.

Table 3 Performed tests: alternative and null hypotheses

4.3 Experimental results

According to the evaluation scenarios of our experiment listed above, we received the following results:

  1. 1.

    The results of monolingual training and testing of MUSE on English, Hebrew, and Arabic corpora are demonstrated in Tables 4 and 5 for ROUGE-1 and ROUGE-2, respectively. The average ROUGE values obtained using 10-fold cross validation are reported.

  2. 2.

    Tables 4 and 5 show the comparative results for MUSE and unsupervised methods on each corpora, for ROUGE-1 and ROUGE-2 respectively. From Tables 4 and 5 it can be concluded that MUSE performs significantly better (see the statistical analysis of Test 1 below) than other (unsupervised) summarizers in all three corpora,Footnote 25 except the baseline in Arabic that was non-distinguishable from MUSE based on ROUGE-2 score (see the explanation of this phenomenon below). According to the p values of Test 1, the null hypothesis (“MUSE does not outperform other approaches.”) can be rejected at the 0.01 significance level.

  3. 3.

    The results of cross-lingual training are presented in Tables 6a and 7a. From Tables 6a and 7a it can be seen that the null hypothesis of Test 2 (“Training MUSE on source-language corpora does not decrease the summarization quality.”) can be rejected only for the Hebrew and Arabic summarizers in most cases. According to the p values of Test 2, it can be rejected at the 0.01 significance level. An exception to this conclusion is the Arabic summarizer using the Hebrew model, where the decrease in both ROUGE scores was not significant. Surprisingly, the English summarizer performs significantly better using foreign models then using models trained on the English corpus. The possible reasons for that outcome include a larger number of annotators per each document in the Hebrew and Arabic corpora and the Gold Standard summaries in English being extracts rather than abstracts. Even when trained on the foreign language, MUSE outperforms TextRank for most cases, as can be seen from Tables 4, 5, 6 and 7. Training MUSE on two source corpora instead of one improved the results of training on a single corpus for Hebrew summarizer only.

  4. 4.

    Tables 6b and 7b present the results of applying the summarization model, trained on documents translated into a target language, to original documents in the same target language. For example, applying the model, which was trained on the English corpus (DUC 2002) translated into Hebrew, to the original Hebrew corpus resulted in 0.518 ROUGE-1 Recall score (Table 6b, first row, second column). The quality of summarization after training the summarizer on original and translated data is quite close, though statistically distinguishable in most cases (see the statistical analysis of Test 3 below). The results in Tables 6b and 7b demonstrate a significant improvement in summarization quality when the following translated corpora are used—Arabic to English, Arabic to Hebrew, English to Hebrew, and English to Arabic—for summarizing documents in English, Hebrew and Arabic, respectively. In all other cases the null hypothesis cannot be rejected—no significant improvement has been observed. Translating Hebrew to Arabic even decreased the summarization quality in terms of ROUGE-2 scores. Based on these experimental results, it seems that translation may help to get more accurate models and improve the cross-lingual learning of MUSE, given a high-quality machine translation tool for a source-target pair of languages. We suppose that, since Hebrew is a resource-poor language, machine translation from Hebrew suffers from a low quality. Training MUSE on two source corpora instead of one did not improve the results of training on a single corpus.

  5. 5.

    Applying the estimated MLR model to predict the sentence relevance score for summarizing the same set of documents resulted in the 0.426 ROUGE-1 score that equals the Baseline score and is significantly lower than the MUSE score. The results of a pairwise comparison of weights in the two models show that there is no correlation between the two weighting vectors (Pearson correlation  =   − 0.172). A possible reason for a difference between GA and MLR models is that in our experiments GA and MLR used slightly different objective functions. The GA fitness function was the ROUGE score of complete document summaries generated by a candidate solution (a global objective), whereas MLR used the ROUGE scores of single sentences as its objective function (in order to obtain a sentence-ranking model) and compiled final summaries from top-ranking sentences. Since the greedy approach does not necessarily solve global optimization problems (knapsack), MLR performed worse as a global optimizer. The optimization procedures are also different: GA explores simultaneously a diverse population of candidate solutions and strikes a balance between exploration and exploitation, whereas the MLR minimizes the sum of squared residuals. Apparently, the GA approach has an advantage in both aspects.

Table 4 Mono-lingual training
Table 5 Mono-lingual training
Table 6 MUSE. Cross-lingual training
Table 7 MUSE. Cross-lingual training

Tables 4, 5, 6 and 7 demonstrate the results of statistical tests, by marking significantly different scores by stars (*p value of 0.05, **p value of 0.01, and ***p value of 0.001). Tables 4 and 5 contain the results for Test 1, Tables 6a and 7a mark ROUGE scores obtained by cross-lingual training that are significantly lower than the scores obtained by monolingual training (Test 2), and Tables 6b and 7b conclude comparison results between ROUGE scores obtained by cross-lingual training with translation and without it (Test 3).

It can be seen that the obtained ROUGE scores are very different for the three languages: the lowest values were obtained for English summaries, while the highest ones were obtained in the Arabic corpus. This phenomenon can be explained by different characteristics of the gold standard in each corpus. For example, DUC 2002 corpus (English) contains 2–3 abstracts for each document, each one of approximately 100 words, the Hebrew corpus consists of five 100-word extracts per document in average, and the Arabic corpus contains exactly three 100-word extracts per document. Since all evaluated summarization methods generate extracts, their matching with human-generated extracts in Arabic and Hebrew was higher than with English abstracts. Another limitation of gold standard extracts in the Arabic and Hebrew corpora is that many summaries are compiled of initial sentences. It causes the superiority of the single unsupervised method (called “baseline”) which takes initial sentences as a summary in both corpora.

Figure 6 present models learned by MUSE on different monolingual corpora using ROUGE-1 and ROUGE-2, respectively. It is noteworthy that while the optimal values of weights in the linear combination were expected to be nonnegative, the actual results in the trained models included some negative values. Although there is no simple explanation for this outcome, it may be related to a well-known phenomenon from Numerical Analysis called over-relaxation (Friedman and Kandel 1994). For example, the Laplace equation ϕ xx  + ϕ yy  = 0 is iteratively solved over a grid of points as follows: At each grid point let \(\phi^{(n)}, \overline\phi^{(n)}\) denote the nth iteration as calculated from the differential equation and its modified final value, respectively. The final value is chosen as \(\omega\phi^{(n)} + (1 - \omega)\overline{\phi}^{(n-1)}\). While the sum of the two weights is obviously 1, the optimal value of ω, which minimizes the number of iterations needed for convergence, usually satisfies 1 < ω < 2 (i.e., the second weight 1 − ω is negative) and approaches 2 the finer the grid gets. Though somewhat unexpected, this surprising result can be rigorously proved (Varga 1962). Relative to the summarization problem, overrelaxation means using higher positive weights, i.e. “awards” for “better” features and attaching negative weights, i.e. “penalties” to “worse” features. As it can be seen from the charts, there are features that have a similar behavior across languages (for example, position and coverage features), also there are features that always get high positive (POS_F and POS_B) or high negative (POS_L) weights. Some features are correlated (Litvak 2010; Litvak et al. 2010a). However, this should not affect the performance of our method, which chooses the optimal weights of all features simultaneously.

Fig. 6
figure 6

Models trained on monolingual corpora: ROUGE-1 (left) and ROUGE-2 (right)

We performed additional experiments for a deep analysis of the GA behavior on our text summarization problem. First, we checked whether termination of the GA ended with the same solutions over multiple cross validation runs, by calculating cosine similarity between these solutions and their centroid (average) vector. According to our experimental results on the Hebrew corpus, the solutions over multiple cross validation runs are very close to each other with the average cosine similarity = 0.75. Second, in order to indicate whether GA “stuck” in a local optima, we calculated the distribution of the last generation of vectors. According to the experimental results on Hebrew corpus, the last generation of vectors appears to be relatively diverse, since only 20 % of the final population has a cosine similarity of more than 0.5 to the centroid vector. Since high genotypic diversity is supposed to prevent premature convergence to a local optimum (Burke et al. 2004), this may lead us to the conclusion that the best fitness in our final population may actually be close to the global optimum.

Figures 7, 8, and 9 demonstrate sample documents and their summaries—in a source language and translated to English—generated by MUSE for the Arabic, Hebrew and English languages, respectively. The summaries’ length was restricted to 100 words. It can be seen that the summaries contain the most informative sentences from the original documents, avoiding small details.

Fig. 7
figure 7

Arabic document titled “America: an unprecedented step in theInternational (Monetary Fund)”” and its summary. a Source document, b translated document , c original summary, d translated summary

Fig. 8
figure 8

Hebrew document titled “Netanyahu and Abbas agreed to complete negotiations within a year” and its summary. a Source document, b translated document , c original summary, d translated summary

Fig. 9
figure 9

English document titled “Images reveal Indonesian tsunami destruction” and its summary. a Source document, b summary

5 Conclusions and future work

In this article, monolingual and cross-lingual methods for training MUSE—a supervised approach for multilingual summarization were described and evaluated on three different languages: English, Hebrew, and Arabic. The evaluation included three different scenarios: (1) retraining for each new language on a new corpus of documents in the target language, (2) using the same training model across different languages, and (3) using parallel corpora (based on machine translation) for retraining MUSE on each new language.

The experimental results show that MUSE significantly outperforms TextRank, the best known language-independent approach, in three languages and all scenarios using either monolingual or parallel corpora. The results also suggest that the same weighting model is applicable across multiple languages and, despite a statistically distinguishable decrease in the summarization quality compared to the mono-lingual summarization, this approach still preserves a reasonable level of quality while saving the annotation efforts for each target language. On the other hand, using translated corpora may improve the cross-lingual performance of MUSE versus training on source-language corpora, while requiring a minor effort from the end user in preparing the machine-translated version of an existing corpus in any language.

During our research we tried to analyse the reasons of MUSE’s superiority by experimenting with different settings:

  1. 1.

    Replace GA by another (MLR) learning procedure (see Sect. 4.3 above),

  2. 2.

    Reduce the number of features (Litvak et al. 2010a),

  3. 3.

    Restrict all feature weights to non-negative values only (Litvak 2010).

According to our results, we can conclude that MUSE has reached its performance superiority due to a large set of features relevant to the summarization task combined with GA as a good choice for optimizing a linear combination of those features. Allowing both positive and negative weights in the linear combination has improved the results as well.

Generally, we can conclude that a combination of as many independent statistical features as possible can compensate for the lack of linguistic analysis and knowledge when selecting the most informative sentences for a summary. One can add more sentence features, and/or use another sophisticated supervised model for learning and optimizing a feature combination. We believe that such approach would work in a general case–retraining on different genres and languages.

Based on evaluation results, the following may be recommended: If a corpus in the target language exists, the best approach is to train MUSE on the target-language corpus, while periodically updating the trained model when new annotated data becomes available. If there is a corpus in a source language, but no high-quality target-language corpus is available, the recommendation is to create a machine-translated corpus for the target language and apply cross-lingual learning of MUSE using this parallel corpora. Using any unsupervised method which does not require training on any language is not recommended, since none of these methods were found to outperform MUSE on any of the three languages.

In the future work, it is suggested to evaluate MUSE on additional languages, language families, and genres, incorporate threshold values for threshold-based methods (Table 2) into the GA-based optimization procedure, improve performance of similarity-based methods in the multilingual domain, apply additional optimization techniques like Evolution Strategy (Beyer and Schwefel 2002), which is known to perform well in a real-valued search space, reduce the search for the best summary to the problem of multi-objective optimization, combining several summary quality metrics, extend the Arabic and Hebrew corpora to improve the quality of the trained summarization model, and adapt the MUSE approach to multi-document summarization.