1 Introduction

In recent years, the explosive growth of online textual data necessitates the evolution of document summarization systems, which aim at producing a shorter version of original documents, while preserving the main information. Moreover, since it can facilitate many widespread downstream applications, such as generating news digests, headlines, and automatically writing reports, many efforts have been invested in this task [126].

Document summarization methods can be mainly divided into two categories: abstractive [273032] and extractive [7929]. In particular, abstractive approaches generate concise summaries by the techniques of paraphrasing and word replacing, while extractive approaches form summaries by means of identifying and concatenating salient text spans (e.g., sentences) from documents. Extractive methods are usually simpler and more computationally efficient than abstractive ones, and meanwhile, guarantee the syntactic and semantic correctness of the generated summaries [4]. Hence, we focus on extractive summarization in this paper. Moreover, there are two main types of extractive models: auto-regressive and non auto-regressive. Compared with non auto-regressive models [929], auto-regressive extractive summarization [72840] is believed to be a more reasonable strategy, which predicts the extraction label of the current sentence taking into account the labels of previously extracted sentences i.e., partial summary. Existing methods [728] construct the partial summary representation through a weighted aggregation of previous sentence representations where the weights are given by their extraction probabilities (Figure 1 illustrates an example).

Figure 1
figure 1

Illustration of the disadvantages in previous auto-regressive extraction models, e.g., SummaRuNNer [28]

1.1 Challenges and contributions

An obvious discrepancy of the existing extractive summarization models is that sentence extraction is a straightforward Yes-or-No option, and there is no partial extraction. This discrepancy, referred to as partial extraction discrepancy hereinafter, constitutes a noisy representation of summaries, degrading the effectiveness of decisions on selecting subsequent sentences. For instance, as demonstrated in Figure 1, although sentences 1 and 3 will not be extracted in the final summary, their representations (noise) are still included into the partial summary representation, and consequently the estimated extraction probability of the subsequent sentences is affected. The fundamental cause of this problem is that the existing workflow is a ranking-based approach, which has to first finish predicting the extraction probabilities of all sentences, and thenceforth collects sentences with Top-K highest extraction probabilities as the summary. In other words, the model is agnostic of which sentences would be extracted until all the sentences have been processed, thus infeasible to derive the unbiased partial summary representation.

Table 1 Example of lead bias. \(\bar {R}\) is the averaged ROUGE-1/2/L F1 score

Another disadvantage of the extractive methods is lead bias [91224], referring that the output summary is mostly composed of the leading sentences. It is due to the sequential nature of the sequence labelling process, where leading sentences are exposed to the model first, and once they are extracted and updated into the partial summary representation, later sentences may be considered redundant and get rejected, regardless of whether these sentences would be a better choice. Consider the illustrative example provided in Figure 1 and Table 1, although sentences {3,4,5} compose a better summary (\(\bar {R} = 55.38\)), the extractive model may end up with a suboptimal summary {2,4,5} (\(\bar {R} = 54.66\)). Disregarding that sentence 3 is essentially a better substitution (\(\bar {R} = 44.12\)), sentence 2 is likely to be extracted due to its informativeness (\(\bar {R} = 42.29\)). Once sentence 2 is extracted, sentence 3 would be considered as highly duplicated (nearly identical with sentence 2) and get rejected, ending up with a suboptimal summary {2,4,5}.

In this paper, we introduce AES-Rep (Auto-regressive Extractive Summarization with Replacement), a novel auto-regressive extractive model that performs a series of summary update actions to constitute a summary. To address the first problem, unlike the widely-used ranking-based methods, we develop a classification setting to explicitly maintain a partial summary, which is straightforwardly updated using two actions: IGNORE the current sentence, or ADD it to the partial summary. If IGNORE/ADD is selected, the representation of current sentence will be completely excluded/included into the partial summary representation, respectively, preventing the error accumulation and propagation of sentences that would not be extracted. Actually, the requirement of an instant prediction for each sequence sentence poses a great challenge for an ordinary classifier. To achieve that, instead of using a single loss function, we craft attentive loss based on ROUGE distribution to optimize partial features (i.e., attentive document representation).

For the second problem, to realize a fair competition and alleviate the disadvantages brought by sentence position, we introduce the third action: REPLACE an extracted sentence with the current sentence, where an external replacement locater module is designed to further determine which sentence in the partial summary will be replaced by the current one, and update the partial summary accordingly. More specifically, we incorporate an introspective alignment between alternative sentence representations. This not only imbues our model with reasoning capabilities but enables a fine-grained comparison between aligned representations. Additionally, we investigate the distribution of relative distance between valid replacements, and exploit distance information as an indispensable clue for the replacement locater module according to our statistical results. In this way, the final selected sentences are decided by the expressiveness of sentences themselves with respect to the main idea of the document rather than over-exploiting position advantages of the sentences.

Overall, our major contributions in this work are fourfold:

  • For the first time, we investigate the problem of partial extraction discrepancy existed in auto-regressive extractive methods and give the fundamental cause of this problem.

  • We propose a new extractive summarization framework, which can allocate instant explicit actions to the sequence of sentences in the document and constitute a clean partial summary to facilitate accurate actions on subsequent sentences during extraction.

  • We design a replacement locater module. By leveraging introspective alignments and distance information, our model is able to reselect more crucial sentences for expressing the main idea of the document without the limitation of positions of sentences.

  • We conduct extensive experiments on widely-used datasets, and the experimental results verify the superiority of our proposed model compared with various strong baselines.

The remaining of the paper is organized as follows: we introduce the details of our proposed model in Section 2 and report the experimental results in Section 3; the current literature of document summarization is discussed in Section 4, followed by a brief conclusion of the work in Section 5.

2 Methodology

In this section, we first provide some preliminary illustrating how we encode sentences with heterogeneous graph following prior works. We then describe the overall workflow of the proposed AES-Rep model. Lastly, we elaborate on how the main components, extraction decision module and replacement locater module work in detail.

2.1 Preliminary

Witnessing the success of applying heterogeneous graph into non auto-regressive summarization, we follow the work [38] to encode sentences.

There are two types of nodes in the heterogeneous graph, namely word nodes and sentence nodes. The sentence node is connected with word nodes contained in the sentence and the word node is connected with its composed sentence nodes. Formally, a document can be represented as a heterogeneous graph G = {V,E}. Here, the node set V is the union of word nodes \(V_{w}=\{w_{1}, w_{2}, \dots , w_{m}\}\) and sentence nodes \(V_{s}=\{s_{1},s_{2},\dots ,s_{n}\}\), where m and n denote the number of unique words and sentences respectively. The edge set E contains all connected word-sentence node pairs (wi,sj), representing the connectivity of the heterogeneous graph.

After constructing the heterogeneous graph, we initialize the graph by associating each node with a real-valued vector, which will be progressively updated and refined during the subsequent iterative update phase. The word nodes are initialized with pre-trained word embeddings. For sentences, we use CNN with various filter sizes to capture diverse local n-gram features and then apply LSTM over them to obtain global semantic features. Then, we initialize sentence nodes by applying a linear transformation over the concatenation of local and global sentence features. \({H_{w}^{0}}\) and \({H_{s}^{0}}\) symbolize the initial representations of all word and sentence nodes respectively.

After initialization, the heterogeneous graph updates the node representations by iteratively passing messages between word and sentence nodes. Given the constructed graph G with initial node representations (\({H_{w}^{0}}, {H_{s}^{0}}\)) and edge set E, we apply the Graph Attention Network (GAT) [36] to update semantic node representations. Formally, with hi signifying the representation of the i-th (word or sentence) node, the GAT works as follows:

$$\begin{aligned} e_{ij} &= \operatorname{LeakyReLU}\left(\vec{a}^{T}[Wh_{i}\|Wh_{j}]\right)\\ \alpha_{ij} &= \frac{exp(e_{ij})}{{\sum}_{j^{\prime} \in \mathcal{N}_{i}}{exp(e_{ij^{\prime}})}}\\ u_{i} &= \sigma\left(\sum\limits_{j \in \mathcal{N}_{i}}{\alpha_{ij}Wh_{j}}\right) \end{aligned}$$
(1)

where ∥ denotes concatenation, σ indicates activation function, \(\vec {a}, W\) are trainable parameters, αij is the attention weight between hi and hj and \(\mathcal {N}_{i}\) is the neighbor set of node i containing all j such that (Vi,Vj) ∈ E. The above vanilla attention is extended to multi-headed attention [35], where K independent attention mechanisms are performed and their outputs are concatenated:

$$u_{i} = \|_{k=1}^{K}{\sigma\left(\sum\limits_{j \in \mathcal{N}_{i}}\alpha^{k}_{ij}W^{k}h_{i}\right)}$$
(2)

where the superscript k indicates the attention weights \(\alpha ^{k}_{ij}\) and the transformation matrix Wk are from the k-th attention mechanism.

We further introduce residule connection [14] to avoid gradient vanishing and Position-wise Feed Forward Network [35] to enhance expressiveness. Then the word and sentence representations are updated in an iterative manner. Each iteration contains a sentence-to-word update and a word-to-sentence update. The t-th iteration can be denoted as follows:

$$\begin{aligned} U^{t+1}_{w \gets s} &= GAT\left({H^{t}_{w}}, {H^{t}_{s}}, {H^{t}_{s}}\right)\\ H^{t+1}_{w} &= FFN\left(U^{t+1}_{w \gets s} + {H^{t}_{w}}\right)\\ U^{t+1}_{s \gets w} &= GAT\left({H^{t}_{s}}, H^{t+1}_{w}, H^{t+1}_{w}\right)\\ H^{t+1}_{s} &= FFN\left(U^{t+1}_{s \gets w} + {H^{t}_{s}}\right) \end{aligned}$$
(3)

After T iterations, we collect the ultimate (sentence) node representations \({H^{T}_{s}}=\left [h^{T}_{s1}, h^{T}_{s2}, \dots , h^{T}_{sn}\right ]\) as sentence representations. For brevity, hereinafter, we neglect the superscript T and subscript s (sentence node indicator), and reuse the symbol hi to denote the representation of the i-th sentence.

Figure 2
figure 2

Model Overview. Extracted sentences (1, 2, 4) are shown in italics and the current sentence (5) is shown in bold. Sentence representations are generated by the Sentence Encoder, and other features are subsequently constructed. The extraction logits and the replacement logits are jointly normalized to produce a probability distribution over all the possible actions

2.2 Overall workflow

Figure 2 illustrates the detailed workflow of AES-Rep. After encoding the document, the Extraction Decision Module estimates the extraction affinity for each sentence based on sentence representation and other auxiliary features. Further, the Replacement Locater Module estimates the propensity of replacing each extracted sentence with the current sentence. Then, the raw extraction and replacement logits are jointly normalized to produce a distribution over all the actions, guiding the update of the summary. Eventually, sentences remained in the summary will serve as the output of the document.

Next, we provide formal definitions of the two tasks (i.e., sentence extraction and replacement) and attach a variable table (Table 2) to help readers understand and follow the paper.

Sentence extraction

The Extraction Decision Module determines whether each sentence in the document should be extracted and added to the summary or not. Given the current sentence si, the extraction decision module returns a 2-dimensional vector eR2 indicating the confidence scores of ignoring and extracting the sentence, respectively. The confidence score would be higher if the corresponding action leads to a higher ROUGE score after updating the summary. More details will be introduced in Section 2.3.

Sentence replacement

The Replacement Locater Module determines the propensity of replacing each extracted sentence with the current sentence. Given the current sentence si and the summary list containing indices of extracted sentences \(S=\{c_{1}, c_{2}, \dots , c_{k}\}\), the replacement locater module returns a k-dimensional vector rRk, where rj has a greater value if the resulting summary after replacing \(s_{c_{j}}\) with si (\(\{s_{c_{1}}, \dots , s_{c_{j-1}}, s_{c_{j+1}}, \dots , s_{c_{k}},s_{i}\}\)) is of higher ROUGE score compared with other possible replacements. We will elaborate on the details in Section 2.4.

Table 2 A variable table that clarifies the dimensions and meanings of the primary variables used in this section

2.3 Extraction decision module

The Extraction Decision Module determines whether each sentence should be extracted and added to the summary. The extraction decision depends on not only the sentence itself but also the document representation and partial summary representation, to extract informative and non-redundant sentences.

Conventional average pooling [28] assumes uniform importance across sentences when constructing document representation, which may not be optimal. Attention mechanism, a technique that differentiate relevant part from others in the input, has achieved promising results in Machine Translation [325] and Document Classification [42]. Considering the fact that sentences contribute differently to the semantics of the document, we apply attention mechanism to attribute higher weights to informative sentences when synthesizing document representation. The attentive document representation d is as follows:

$$\begin{aligned} u_{i} &=\tanh \left(W_{att} h_{i}+b_{att}\right) \\ \alpha_{i} &=\frac{\exp \left(u_{i}^{\top} u_{att}\right)}{{\sum}_{i} \exp \left(u_{i}^{\top} u_{att}\right)} \\ d &=\sum\limits_{i} \alpha_{i} h_{i} \end{aligned}$$
(4)

Among the previous equations, αi denotes the importance of the i-th sentence. Watt,Wd,batt,bd and uatt are trainable parameters that would be optimized during model training. The attentive document representation is essentially a weighted combination of sentence representations using αi as weights.

In general, our task is learning to assign appropriate action to each sentence such that the updated summary is of the greatest ROUGE score, where heuristically generated oracle action distribution is provided for supervision. Since the update actions explicitly manipulate the summary, it is feasible to maintain a summary list to track which sentences were extracted, and derive the partial summary representation solely based on extracted sentences, thus avoid the partial extraction discrepancy.

We obtain the partial summary representation v by summing up the sentence representations of extracted sentences and normalizing with the tanh function to keep the magnitude remains the same for all time-steps.

$$v = \tanh \left(\sum\limits_{i=1}^{k}{h_{c_{i}}}\right)$$
(5)

The final sentence representation of the i-th sentence ui for extraction decision is the concatenation of the document representation d, the partial summary representation v and the sentence representation hi. Then ui is fed into the extraction classification layer for a two-way classification.

$$\begin{aligned} u_{i} &= [d;v;h_{i}]\\ e_{i} &= W_{ext}u_{i} + b_{ext} \end{aligned}$$
(6)

where \(W_{ext} \in \mathbf {R}^{2 \times d_{ext}}, b \in \mathbf {R}^{2}\) are trainable parameters.

2.4 Replacement locater module

Given the summary list \(S = \{c_{1}, c_{2}, \dots , c_{k}\}\) containing indices of extracted sentences, the Replacement Locater Module determines the propensity of replacing each extracted sentence with the current sentence.

Formulation

To model the propensity of replacement, we pair each extracted sentence \(s_{c_{j}}\) with the current sentence si for a binary classification, where the output rj has a greater value if after replacing \(s_{c_{j}}\) with si, the resulting summary \(\{s_{c_{1}}, \dots , s_{c_{j-1}}, s_{c_{j+1}}, \dots , s_{c_{k}},s_{i}\}\) is of higher ROUGE score compared with other possible replacements.

Sentence pair representation

To determine whether current sentence si is a good replacement for extracted sentence \(s_{c_{j}}\), we construct the resourceful sentence pair representation sp with features that we believe are useful for the replacement classifier to make correct decisions.

First, we need to consider which candidate (\(s_{c_{j}}\) or si) is more relevant to the main point of the document, so we introduce the attentive document representation d into the sentence pair representation.

In addition to the document representation, we need to consider their relation to the remaining sentences in the partial summary (excluding \(s_{c_{j}}\)), that is, which sentence better complements the remaining sentences. Therefore, we construct two summary representations, namely the original summary representation v (identical to partial summary representation) and the new summary representation \(v\prime\).

$$\begin{aligned} v &= \tanh \left(\sum\limits_{i=1}^{k}{h_{c_{i}}}\right)\\ v\prime &= \tanh \left(h_{i} + \sum\limits_{i=1,i \neq j}^{k}{h_{c_{i}}}\right) \end{aligned}$$
(7)

Moreover, we obtain the introspective alignment to capture the interaction between the two candidate summary: i) element-wise product that amplifies or dampens the matching signals between the two representations; ii) element-wise difference that measures the distance between the two representations.

$$\begin{aligned} prod &= v \odot v\prime\\ diff &= v - v\prime \end{aligned}$$
(8)

Additionally, we check the distribution of the distance between two alternative sentences in the oracle labels, and observe that the probability of replacement decreases as the distance between the two sentences increases, as depicted in Figure 3. Therefore, we introduce the replace distance embeddingRepDist(icj) to mitigate spurious long distance replacement.

Figure 3
figure 3

Replace distance distribution

Finally, we obtain the resourceful sentence pair representation sp by concatenating all these aforementioned features:

$$sp_{(c_{j},i)} = \left[d;\,RepDist(i-c_{j});\,v;\,v^{\prime};\,prod;\,diff\right]$$
(9)

sp is then fed into the final replacement classification layer to obtain the replacement propensity rj.

$$r_{j} = w_{rep} sp + b_{rep}$$
(10)

where \(w_{rep} \in R^{1 \times d_{rep}}\) and brepR are trainable parameters.

2.5 Loss functions

Attentive loss

As the ROUGE score of individual sentence can be interpreted as a measure of sentence importance, we would like the attention scores to approximately match the sentence ROUGE score distribution. The ground truth ROUGE distribution is computed as:

$$P_{rouge}(i)=\frac{r(s_{i}, ref)}{{\sum}_{j=1}{r(s_{j}, ref)}}$$
(11)

where ref is the reference summary and r is a ROUGE based scoring function. The attentive loss Latt is the KL-Divergence between the attention scores and the ground truth distribution:

$$L_{att}(\theta) = -\sum\limits_{i=1}^{n}{P_{rouge}(i)\log\left(\frac{\alpha_{i}}{P_{rouge}(i)}\right)}$$
(12)

Action loss

Given the current sentence st and the extracted sentence indices \(S=\{c_{1}, c_{2}, \dots , c_{k}\}\), we concatenate the outputs of the Extraction Decision Module and the Replacement Locater Module to obtain the raw action logits z, and the extraction logits and the replacement logits are jointly normalized to produce a probability distribution over all the possible actions:

$$\begin{aligned} z &= [e_{1}, e_{2}, r_{1}, r_{2}, \dots, r_{k}]\\ \hat{P}^{(t)}_{i} &= \frac{exp(z_{i})}{{\sum}_{j=1}^{k+2}{exp(z_{j})}} \end{aligned}$$
(13)

The action loss at timestep t is the KL-Divergence between the action probabilities \(\hat {P}^{(t)}\) and the ground truth action distribution P(t) (We will discuss how to generate the ground truth distribution P in Section 3.1) and the action loss for the entire document is the averaged loss across all the timesteps:

$$\begin{aligned} L^{(t)}_{act}(\theta) &= -\sum\limits^{k+2}_{i=1}{P_{i}^{(t)}\log \left(\frac{\hat{P}^{(t)}_{i}}{P^{(t)}_{i}}\right)}\\ L_{act}(\theta) &= \frac{1}{n}\sum\limits_{t=1}^{n}{L^{(t)}_{act}(\theta)} \end{aligned}$$
(14)

The final loss of the AES-Rep is the weighted combination of the two losses with a hyper parameter λ controlling the relative contribution of attentive loss:

$$L(\theta)=L_{act}(\theta) + \lambda L_{att}(\theta)$$
(15)

In this way, our model is able to consider multiple evidence and finally achieve the global optimal solution.

3 Experiments

We have conducted extensive experiments on the most commonly used datasets to evaluate the performance of our proposed AES-Rep model, and the experimental results are reported in this section.

3.1 Experimental setup

Datasets

We evaluate our model on the widely-used CNN/DailyMailFootnote 1 dataset and the separated CNN and DailyMail datasets, which contain news articles and their highlights (used as abstractive reference summary). Following previous work [632], we adopt the standard split for the train, validation, test set and obtain the tokenized, non-anonymized dataset by preprocessing. Moreover, we also conduct experiments on WikiHow datasetFootnote 2 [18]. WikiHow is a large-scale summarization dataset extracted and constructed from an online knowledge base written by different human authors. The articles cover a wide range of topics and represent high diversity styles. We show the statistics of the datasets in use in Table 3.

Table 3 Statistics of summarization datasets: the size of train, valid, test splits and average length of documents and summaries (in terms of word and sentence) are reported

Evaluation metrics

We employ ROUGE [20] as the evaluation metric to measure how the model summary resembles the reference summary by counting the number of overlapping lexical units like n-grams and word sequences. Following the common practice, we report ROUGE-1, ROUGE-2, and ROUGE-L F1 results balancing the precision and the recall, where ROUGE-1 and ROUGE-2 measure informativeness via counting overlapping n-grams and ROUGE-L measures fluency through the longest common subsequence. We leverage the average of all ROUGE F1 variants as the scoring function r:

$$\begin{array}{@{}rcl@{}} r(sum, ref) &=& \frac{1}{3}(\operatorname{ROUGE-1}(sum, ref)\\ &&+ \operatorname{ROUGE-2}(sum, ref) + \operatorname{ROUGE-L}(sum, ref)) \end{array}$$
(16)

We also estimate statistical significance by running our model with different random seeds and performing the t-test between our results and the best baseline performance. We compare p-value with 0.05 and 0.01, and highlight “significant” improvement achieved by our model via * or ** respectively in the following tables.

Model settings

We limit the size of vocabulary to 50,000 and initialize the embeddings with 300-dimensional GloVe [31] word vectors. The filter sizes of CNN for extracting local features range from 2 to 7 with 50 feature maps each, and the LSTM for capturing global features is a 2-layer bidirectional LSTM with hidden size 128 in each direction. Following [38], we skip stopwords and 10% words with low TF-IDF values when constructing word nodes. The dimension of word and sentence node representation is set to 300 and 384 respectively. The GAT has 4 attention heads for word nodes and 6 attention heads for sentence nodes. The intermediate hidden size of FFN layer is set to 1536. The number of iterations T is set to 2. The size of replacement distance embedding is set to 384 as well. The weighting factors λ in (15) is set to 0.1. The temperature τ in (17) is set to 0.05 and 0.01 on CNN/DailyMail and WikiHow respecitvely. To regularize the model, we apply Dropout [34] with probability 0.1 to the output of the first LSTM layer, GAT inputs, GAT attention weights and the intermediate output of FFN. The model is trained with Adam [17] optimizer with batch size 64. For the hyperparameters of Adam, we set the learning rate lr = 0.0005, the two momentum coefficients β1 = 0.9,β2 = 0.999 and 𝜖 = 10− 8, respectively. Furthermore, we employ gradient norm clipping to rescale the norm to at most 2.0. We train the model for 20 epochs and select the checkpoint based on the averaged ROUGE score and report the evaluation results on the test set.

Algorithm 1
figure a

Greedy approach to generate ground truth action distributions.

Ground truth generation

Since our model is based on a novel setting, there are none handy annotated labels for the training data. To get rid of this, we utilize a greedy approach to construct oracle labels, which is based on the intuition that the action that incurs more ROUGE gain concerning the reference should have a higher probability. Algorithm 1 depicts the details on how we generate the ground truth action distributions based on the human-written abstractive summaries.

For the ground truth action distribution at timestep t, given current sentence st and summary \(S=\{s_{c_{1}}, s_{c_{2}}, \dots , s_{c_{k}}\}\) containing k sentences, the \(gains = [g_{1}, g_{2}, \dots , g_{k+2}]\) (lines 6-14) is an array with length k + 2 containing the ROUGE score gains of each action (ignore, add, and replace each extracted sentence with the current sentence). The normalize function (line 28) is defined as follows to produce a valid probability distribution P:

$$P_{i} = \frac{\exp(g_{i} / \tau)}{{\sum}_{j=1}^{k+2}{\exp(g_{j} / \tau)}}, i=1,2,\dots,k+2$$
(17)

where τ is a hyperparameter controlling the smoothness of the distribution. After obtaining the action probabilities, the summary is updated by applying the action with the maximum probability (lines 20-27).

Sequentially updating the summary with the current best action in a greedy manner may lead to local optimum, producing suboptimal summary thereby degrading the effectiveness of the downstream summarization model. For this concern, we investigate theoretically and empirically to conclude that the greedy heuristics has minor effect on downstream summarization task. On one hand, finding a globally optimal extraction oracle is computationally expensive. As an approximation, greedy approach has been widely adopted by various competitive systems in generating oracle extraction labels. For example, previous studies [282940] maintain extraction oracle by incrementally adding a sentence at a time to maximize its ROUGE, until none of the remaining sentences improve the ROUGE score of the oracle. On the other hand, we include the extraction oracle in Tables 4 and 5 to provide readers a sense of the performance upperbound. Notice that the oracle is ahead of both our model and other competitive baselines by a considerable margin, thus the performance drop caused by greedy heuristic is a minor concern as the model performance is still far from the upperbound.

Baselines

We compare our proposed model AES-Rep with various baselines: LEAD-3 is a commonly-used baseline method that simply selects the first three sentences in the document as the summary. NN-SE [7] and SummaRuNNer [28] are two cross-entropy based auto-regressive extractive methods, where extraction-probability-weighted sentence representations are used to construct the partial summary representation. RNES [40] is yet another auto-regressive extractive model that combines the cross-sentence coherence and the ROUGE score of the extraction as the reward signal to obtain informative and coherent summaries. NEUSUM [48] jointly learns to score and extract sentences in an auto-regressive manner. REFRESH [29] is an extractive model that treats document summarization as a sentence ranking problem and uses reinforcement learning to globally optimize the ROUGE score. LATENT [43] is a latent variable extractive summarization model that leverages human summaries directly with the help of a sentence compression model. SUMO [23] conceptualizes single document summarization as a tree induction problem. HER [24] is a non auto-regressive method imitating how humans extract summaries, which formulates the learning process as a contextual bandit and solves it with policy gradient reinforcement learning. PACSUM [45] is an unsupervised graph-ranking-based summarization system that uses BERT to capture sentence similarity. Pointer+BERT [46] uses a feature-based BERT (without gradient) as encoder to get token embeddings and employs Pointer Network [37] as decoder to pick summary sentences, Pointer+BERT+RL [46] introduces reinforcement learning to further optimize the model. BERT-ext [2] is yet another architecture based on BERT and Pointer Network, but BERT is utilized to obtain sentence representations directly. HSG [38] constructs heterogeneous graphs by introducing semantic nodes of different granularities, thereby enhancing the model’s capability to learn cross-sentence relations. HSG+Tri-Blocking introduces Trigram Blocking [21] to reduce redundancy in the output summary. For the abstractive models, PGN [32] is capable of generating out-of-vocabulary words by directly copying them from the input document. DRM [30] is trained with a combined loss of supervised learning and policy gradient to mitigate exposure bias and generate readable summaries. BottomUp [10] designs a content selector to determine phrases in a source document that should be part of the summary, and then use this selector as a bottom-up attention step to constrain the model to focus on likely phrases. DCA [5] address the challenge of encoding a long document by introducing multiple collaborating agents, each of which in charge of a subsection of the input text. BERTSumAbs [21] and BERTSumExtAbs [21] are two abstractive models based on BERT where the former adopts the default abstractive training protocol while the latter pretrains the encoder with extractive objectives before abstractive training.

3.2 Results and analysis

In this section, we first report the overall results of quantitative evaluation using ROUGE metrics, and then perform an ablation study to examine the effectiveness of each module in the proposed model. Lastly, we do a case study to showcase the decision process of AES-Rep with specific examples.

Table 4 Full length ROUGE F1 evaluation(%) on the combined CNN/Daily Mail test set

Overall performance

We present the results of AES-Rep together with other selected baselines on the CNN/DailyMail dataset in Table 4. The table is divided into 5 blocks, which respectively report the results of unsupervised baselines (and oracle), auto-regressive extractive baselines, non auto-regressive extractive baselines, abstractive baselines and our model.

When comparing with the unsupervised baselines, our model performs better by a considerable margin. In particular, LEAD-3 only considers the importance of sentence positions in a document and simply uses the first three sentences as the summary. Our AES-Rep model achieves a large increase of the ROUGE scores (2.78%/2.28%/2.72%) over LEAD-3 by allowing later sentences (which might be more topically important) to be added to the summary.

Compared with auto-regressive baselines in Table 4, we observe that AES-Rep surpasses all these models in terms of all ROUGE metrics. Specifically, our model achieves a substantial improvement of (7.71%/5.20%/7.18%) and (3.61%/3.70%/4.08%) over NN-SE and SummaRuNNer concerning ROUGE-1/2/L respectively. We attribute the success to the fix of partial extraction discrepancy and the introduction of the replacement locater module. Our model also shows better performance than RNES and NEUSUM.

For non auto-regressive extractive baselines (mainstream summarization), even HSG+Tri-Blocking which is the state-of-the-art non auto-regressive model (non-BERT-based), AES-Rep demonstrates its performance superiority with significant improvement whose p-value < 0.05. Note that the reported results are produced by directly evaluating our model without involving any post-processing (e.g. trigram blocking). If we compare AES-Rep with plain HSG without post-processing, the performance gap would grow wider.

Finally, AES-Rep outperforms all the selected abstractive baselines as shown in Table 4. It is worth mentioning that our model surpasses a few BERT-based models, where both extractive and abstractive baselines are included. As a backbone, BERT is pre-trained on enormous corpora containing more than 3300 million words. In contrast, our model is exclusively trained on the summarization dataset, which is much more efficient.

Table 5 Full-length ROUGE F1 on the separated CNN and the Daily Mail test set
Table 6 Full-length ROUGE F1 on the WikiHow test set

We also conduct experiments on the separated CNN and DailyMail dataset and report separate results in Table 5. For the baselines, we select those that have conducted experiments on the separated dataset and report their results. Note that we do not elaborately tune hyperparameters per dataset, instead, we reuse the hyperparameters reported in Section 3.1 to examine the versatility of these hyperparameters. As shown in Table 5, AES-Rep consistently outperforms all the competitive baselines on both datasets, where the improvement on each ROUGE metric is quite significant with the p-value < 0.01.

For the out-of-domain evaluation, we report the experimental results in Table 6. As can be observed from the table, the advantages of AES-Rep over other baselines gets smaller compared with that of CNNDM dataset, mainly due to the switch of domains. AES-Rep fails to obtain a comparable ROUGE-2 score compared with PGN w/ Coverage, since the higher level of abstraction of the dataset makes abstractive methods have advantage over extractive methods. However, AES-Rep still achieves better ROUGE-1 and ROUGE-L scores over extractive baselines, with statistical significance p-value less than 0.01 and 0.05 respectively.

Table 7 Ablation studies on the combined CNN/Daily Mail test set
Table 8 Ablation studies on the WikiHow test set

Ablation study

We conduct the ablation study by removing each module of the proposed AES-Rep and observing its effect on the model performance. Firstly, to examine the effectiveness of the sentence replacement mechanism, we deactivate the REPLACE operation during the oracle generation and training stage and predict IGNORE/ADD action for each sentence (w/o REPLACE). Secondly, we keep using the attentive document representation but exclude the attentive loss from the training objectives (w/o AttLoss). Thirdly, we replace the attentive document representation with conventional max/average pooled document representation (also ignore the attentive loss Latt) and name them as (w/o AttPool (+MaxPool)) and (w/o AttPool (+AvgPool)), respectively. Finally, we exclude the replacement distance embedding from the replacement classification to examine how much the replacement distance feature contributes to correct classification (w/o RepDistEmbedding). The results of the ablation study on CNN/DailyMail and WikiHow datasets are presented in Tables 7 and 8 respectively.

Given the results, we have the following observations: (1) The replacement operation is indispensable under our settings, and disabling REPLACE action results in a significant performance degradation on both CNN/DailyMail and WikiHow datasets. Considering the strategy of assigning appropriate IGNORE/ADD action to maximize the ROUGE score of the resulting summary at each timestep, the summary will be instantly filled up with leading sentences, causing catastrophic performance drop especially when good sentences locate at the beginning of the document but there are better substitutions among subsequent sentences. (2) Plain attentive pooling works well on WikiHow dataset but fails to outperform conventional max pooling and average pooling on CNN/DailyMail dataset. However, once sentence-level ROUGE score distribution is introduced to guide the attention weight distribution, attentive pooling can consistently surpass conventional pooling methods on both datasets. (3) Replacement distance embedding provides evidence for replacement locater module from a different perspective, and removing it brings minor performance decline on both datasets.

Parameter sensitivity analysis

We study the robustness of AES-Rep by investigating the performance fluctuations with varied hyperparameters. Specifically, we study the sensitivity of our model to temperature τ in (17) and weighting factor λ in (15). Based on the hyperparameter setup reported in Section 3.1, we conduct standard one-factor-at-a-time analysis by varying the value of one hyperparameter while keeping others at their baseline values, and report the new summarization performance achieved. Similar to Table 4, ROUGE-1/2/L and ROUGE-Mean \(\bar {R}\) are adopted for evaluation.

Figure 4
figure 4

Parameter sensitivity analysis on CNN/DailyMail dataset

Figure 5
figure 5

Parameter sensitivity analysis on WikiHow dataset

Impact of τ

The temperature τ is introduced in (17) to control the smoothness of the ground truth action distribution. As can be seen in Figure 4(a), AES-Rep achieves the best performance (outperforms HSG+TriBlocking) on CNN/DailyMail given τ ∈ [0.05,0.1], while still achieves comparable performance than other baselines when τ ranges broadly from 0.01 to 0.30. Meanwhile, Figure 5(a) illustrates the AES-Rep achieves its peak on WikiHow when τ is around [0.010, 0.012]. In general, AES-Rep benefits from a moderate-ranged τ (the range is dataset dependent, thus requires some tuning), and decreasing or increasing τ hurts performance. We conjecture the reasons are: 1) tiny τ produces nearly one-hot labels. Models trained with these labels fail to distinguish between better and worse non-optimal actions, as they are both labeled as negative. 2) as the magnitude of ROUGE gains (raw logits of softmax) has a magnitude of 0.0x-0.x, the ground truth distribution produced by large τ nearly degenerates to uniform distribution that provides very little or almost no supervision to the model.

Impact of λ

The weighting factor λ is introduced in (15) to control the relative contribution of Attentive Loss. We study the impact of λ ∈{0.001,0.01,0.1,1.0}. Combining Figures 4(b) and 5(b), the attentive loss weight λ is more robust across datasets, showing a similar trend for all evaluation metrics when increasing from 0.001 to 1.0. Specifically, our model consistently benefits from a relatively larger λ, but when λ reaches a certain magnitude (1.0 in our case), the improvement tends to stop. This is quite intuitive: on one hand, for tiny λ, the regularization of attention distribution is not sufficiently regularized thus the learning process does not benefit from this auxiliary task. On the other hand, large λ dominates the supervision signal, hindering the model in effectively selecting appropriate actions for sentences. As can be seen from the results, λ = 0.1 seems to be a good trade-off for attention distribution regularization and effective action classification, where the summarization performance reaches its peak.

Table 9 Example of AES-Rep predictions

Case study

Table 9 shows an example of AES-Rep predictions with additional columns to better illustrate the predicted actions. For conciseness, we only list the first a few sentences, with an unimportant sentence 4 skipped and some irrelevant verbose descriptions replaced by ellipses.

The selected document is a news article about a woman posing as a social worker, stabbing a new mother, and kidnapping a baby as her own child, charged with murder and kidnapping a new mother. For the three leading sentences: sentence 0 illustrates that the victim lives with her newborn daughter. Sentence 1 claims how the criminal deceived the victim’s boyfriend. Sentence 2 is a supplementary explanation of sentence 1 that the criminal never worked for child-welfare and her identity is faked. The three sentences describe the background of the crime from different aspects with minor duplication between each other. It is worth noting that during training, the model learns to assign the optimal action that leads to the greatest ROUGE scores to each sentence. The leading sentences contain the names of the criminal and the victim and how the criminal defraud the victim’s family, which are demonstrated in the reference summary as well, therefore all three leading sentences were extracted. For sentences that received REPLACE action: sentence 5 mentioned the name of the criminal and victim, and listed the charges of the criminal in detail. Compared with Sentence 1, Sentence 6 is more verbose. Although they all mentioned fake identities, Sentence 6 mentioned the baby’s name and the follow-up after stealing the baby. Sentence 7 mentioned that the victim’s body was found in the closet of the criminal ’s home. Although the sentence itself did not share too many common words with the reference summary, it still provides crucial information and is worth being extracted into the summary. Sentences with action IGNORE are either totally irrelevant (sentence 4) or less salient compared with selected sentences (other sentences). From this example, we can see that the action selection module distinguishes salient sentences from irrelevant sentences and assigns appropriate actions (ADD/REPLACE and IGNORE) to them, meanwhile, the replacement locater module can correctly locate less informative candidate sentence from the current summary list and replace it with the current sentence. In conclusion, the entire AES-Rep model exhibits its functionality as expected.

4 Related work

In this section, we review the current literature of document summarization in three categories: extractive models, abstractive models, and combined models.

Extractive approaches

Extractive summarization aims to identify salient sentences and concatenate them to compose the summary. It usually treats summarization as a sequence labelling task, where the model eventually assigns a binary label to each sentence, indicating its inclusion/exclusion in the output.

NN-SE [7] uses a cascade of CNN and RNN as sentence encoder to generate sentence representations and make extraction decisions on top of these representations. SummaRuNNer [28] employs a similar hierarchical encoder, but its predictions are more interpretable and can be broken down into several abstractive features like information content, salience, novelty and so on. REFRESH [29] treats extractive summarization as a sentence ranking problem and proposes a novel training algorithm to globally optimize the ROUGE evaluation through reinforcement learning. BanditSum [912] formulates extractive summarization as a contextual bandit problem and trains the model with policy gradient to maximize the ROUGE score. RNES [40] combines the cross-sentence coherence and the ROUGE score of the extraction as the reward signal to get informative and coherent summaries. BERTSUMEXT [22] obtains sentence representations from BERT [8] followed by several stacked inter-sentence Transformer [35] and makes decisions on top of that. Self-Supervised [39] and HIBERT [44] propose novel pretraining tasks aiming to capture the global context at the document level, then the model is fine-tuned with the extractive labels. MATCHSUM [47] creates a paradigm shift and formulate extractive summarization as a semantic text matching problem. DISCOBERT [41] is a BERT-based model that prevents introducing redundant or uninformative phrases into summary by extracting finer-grained sub-sentential discourse units as candidates for extractive selection. HSG [38] utilizes semantic nodes of different granularity levels to enrich the cross-sentence relations, thus improving the performance of extractive summarization.

Abstractive approaches

Unlike exclusively copying content from the original document in extractive summarization, abstractive models synthesize the summary in a word-by-word manner from scratch, thus may produce novel words and phrases that are not featured in the original document.

AEDRNN [27] uses RNN with attention mechanism as the base model and optimizes the model with several techniques, including adopting structure-aware hierarchical attention and enhancing word vectors with POS/NER tagging information and TF/IDF statistics. CopyNet [13] and PGN [32] enable the model to generate out-of-vocabulary words by directly copying them from the input document. Furthermore, PGN [32] proposes a coverage mechanism to record which words in the document have been attended to and penalize the model for repeatedly attending to same words, seeking to alleviate the generation of repeated content in the output summary. Models solely trained with cross-entropy loss suffer from the exposure bias, DRM [30] utilizes a mixed objective function of supervised learning and reinforcement learning to ease the situation. DCA [49] distributes the task of encoding a long document to multiple collaborating encoders, each in charge of a subsection, and employs a single decoder for the summary generation. ASGARD [16] utilizes structured representation from knowledge graph and designs a reward based on multiple choice cloze test to encourage producing informative and faithful summaries.

Combined approaches

Combined approaches utilize both summarization techniques so that the abstractive model benefits from the information produced by the extractive model. In general, the combined approach first uses extractive methods to identify salient text spans, and then uses abstractive methods to generate the summary conditioning on these salient text spans.

UnifiedSum [15] proposed a model that treats sentence extractive probability (from extractor network) as sentence-level attention to re-weight word-level attention distribution (from abstractor network). They also introduced a novel inconsistency loss to penalize the inconsistency between two levels of attention. FastAbsRL [6] exerts an extractor agent to extract salient sentences from the document. Then an abstractor agent rewrites these sentences into concise summary sentences via compression and paraphrase. Bottom-Up [11] proposes a two-stage bottom-up summarizer. The model first adopts a content selector to identify tokens that should be included in the summary. The summary is generated by a modified pointer generator network whose copy attention distribution is restricted to the summary worthy tokens recognized from the previous step. BERT-Abs [19] applies pre-trained BERT to rank sentence singletons and pairs and then compress or fuse top-ranked instances to summary sentences one after another. SENECA [33] deploys an entity-aware content selection module to collect salient sentences, and then an abstract generation module generates summaries utilizing cross-sentence information.

5 Conclusion

In this paper, we study the two intractable disadvantages in the existing auto-regressive extractive summarization models, i.e., the partial extraction discrepancy and the lead bias, which impair the effectiveness of these models in generating informative document summaries. We then fix the partial extraction discrepancy by explicitly predicting the summary update action for each sentence. Furthermore, we introduce an external replacement locater module to alleviate lead bias by enabling extracted sentences to be replaced by better new sentences. The experimental results on the benchmark CNN and DailyMail datasets show the superiority of AES-Rep compared with the current state-of-the-art baselines.