1 Introduction

A large amount of news articles is published daily to cover important events. The volume of such content can be overwhelming for news readers compromising their ability to get an overall, but concise, picture of what is happening. To solve this issue, multi-document summarization approaches were developed providing a quick access to essential information [4, 9, 10, 12, 13, 19, 22, 26]. The main challenge of summarizing news articles is how to capture the different perspectives, view points, and levels of detail reported for the same event. Once these differences are captured, their inclusion in the summary is far from being trivial. For example, different news platforms reported the following information about the same event:

figure a

The above sentences put the light on different aspects of the dismissal of the FBI director. They give diverse types of details that we call facets, including time, news provenance, and possible reasons for firing the FBI director. Thus, the task of summarization consists in first identifying the main facts and their facets from a set of news articles, and then fusing those facets to have a concise description of each event.

This problem falls into the category of abstractive approaches where sentences of the original text are rephrased to create summaries [1, 3, 5, 6, 8,9,10, 12, 13, 21, 22, 26, 27]. The main idea of abstractive summarization is to leverage fine-grained fact extraction where a fact can be represented as words or semantic triples of the form \(\langle \)Arg1; predicate; Arg2\(\rangle \). Most existing approaches rely on similarity between facts to merge information [1, 3, 6, 9, 10, 13, 19, 22, 26]. In the previous example, sentences S1 and S2 contain the same fact \(\langle \)Donald Trump; fired; FBI director\(\rangle \) and therefore they can be fused to summarize three facets of information including the source of the news, the time, and the reason. Typically the similarity between sentences is based on facts belonging to main clauses. Thus, if two sentences contain similar facts in subordinate clauses, their fusion is not easily handled. For example, considering the main clause, sentence S3 talks about firing the FBI director and sentence S5 talks about “the end of Trump’s presidency”. By contrast, considering the subordinate clause, they both talk about the “Russian investigation and its consequences”. Thus, by fusing S3 and S5 we can have different facets of the fact “Russian investigation”.

The above problem is best handled using semantic summarization [12, 19]. The idea is to extract facts, represented as triples, from text documents and model them as a graph. The nodes of the graph can be either words or word types. Each edge connecting two nodes represents their consecutive occurrence in the same sentence of the original text. Summary sentences are then generated using the top ranked paths in terms of grammaticality and fact coverage. The graph model facilitates the retrieval and the fusion of important triples from both main and subordinate clauses. Additionally, facts are naturally connected along the paths with their facets. So, finding the best paths would automatically lead to finding the most important facets to be included in the summary. The main limitations are related to how this approach was applied to news summarization by Li et al. [12]. First, fact fusion merges triples having similar word types leading, in some cases, to incorrect results. For example, “Trump” and “Obama” both are of type person but they are two different entities and therefore sentences containing them should not be fused together. Second, long paths covering many triples are not necessarily the best since they might concatenate unrelated facts. Third, facts are clustered using predefined themes, which is inflexible for the dynamic nature of news content. In this paper, we aim at tackling the above problems extending the approach proposed by Li et al. [12] to handle news summarization in a more effective way. Our main contributions are as follows:

  1. 1.

    We propose a fact fusion strategy based on entity linking and predicate similarity. We perform entity linking via entity recognition, name normalization, and coreference resolution using Stanford NLP and DBpedia Spotlight, whereas predicate similarity is done using WordNet::Similarity.

  2. 2.

    In addition to the grammaticality and fact coverage, we employ node degrees to rank paths. This results in boosting paths having multiple authoritative nodes and therefore finding important facts to be included in the summary.

  3. 3.

    We propose an alternative to the predefined classification of facts by employing a dynamic grouping using K-means clustering. To this end, we use word2vec [17] trained on the Google News dataset to generate word vectors which are then used to cluster similar facts.

  4. 4.

    We run extensive experiments using the DUC’04 and DUC’07 datasets, showing that our approach outperforms the baseline approaches with a large margin in terms of ROUGE and PYRAMID scores.

2 Related Work

Our work falls into the category of abstractive summarization of news articles. Many previous attempts were based on facts extracted from main clauses [6, 10, 22, 26], in contrast to ours, that aims to enrich summaries with facets obtained from subclauses. Moreover, they did not focus on obtaining new facts by fusing together individual facts, instead they simply merged or clustered similar facts. Nevertheless, some core components of these approaches are related to our work. The first one is a fact extraction technique, which is either done via Open Information Extraction (OIE) such as OLLIE [25] and ClausIE [7], Semantic Role Labeling such as SEMAFOR [11] and SENNA SRL [5], or constituent/grammatical dependency parsing. The second one is a fact merging or clustering technique, which can be adopted for a triple fusion approach. Vanderwende et al. [26] used dependency parsing to extract simple facts, leveraged triple similarity, entity coreference and event coreference for clustering, and then generated summary sentences by unfolding the entity fragments end event fragments. Similarly, D’Aciarno et al. [6] and later Amato et al. [1] leveraged dependency parsing for fact extraction and employed similarity measures based on the subject, predicate, and object for fact merging. Pighin et al. [22] developed their own data-driven pattern extractor for their basic semantic unit, and was able to merge multiple facts expressed in different ways.

Khan et al. [10] developed an approach that exploits SENNA [5] to extract basic semantic unit. They used WordNet::Similarity [21] for SRL unit clustering and then selected their representative fact from each cluster via a scoring function obtained by a genetic algorithm. Summary sentences were generated using SimpleNLG [8] with several heuristics employed. Genest and Lapalme [9] used dependency parsing for fact extraction in the form of INIT (Information Unit), another triple-like structure. Then, they produced summary sentences using a text-to-text generation based on SimpleNLG. Khan et al.’s BSU and Genest & Lapalme’s INIT can be annotated with dative or locative information, which makes their representation richer than the other approaches described above. Similarly, Li [13] used a simple BSU parsing to extract simple facts, which later are clustered based on semantic relatedness between concepts, similarity between verbs, and sentence co-occurrence. The sentence generation is a series of subject, predicate, and object unfoldings from the BSUs. Unfolded objects can be in the form of subclauses. However, fine-grained facts from the subclauses are not specifically extracted and merged.

On the other hand, we found fewer attempts that exploit fact extraction from subclauses. Bing et al. [3] used constituent parsing to extract noun phrases (NPs) and verb phrases (VPs), which then, being separately clustered, are combined after checking whether some NPs and VPs together satisfy constraints such as compatibility and validity. As a result, some NPs can be matched to VPs obtained from different sentences, forming new, potentially more informative summary sentences. Additionally, NPs (VPs) appearing in a constituent subtree up to two levels on a path containing only NP (VP) nodes are also taken into account, effectively parsing independent subclauses, but not dependent ones. Moreover, the approach did not address grammaticality when forming new sentences, and the new sentence formation is limited to coreference resolution of entities. The most relevant work to ours is by Li et al. [12], who developed a pattern-based approach to extract abstractive summaries from news articles. In contrast to Bing et al.’s, Li et al.’s approach is orthogonal to the type of subclause (dependent or independent). It also employs a grammaticality score in order to maintain the quality of its summary sentences. In the next section, we will explain in more detail Li et al’s work and how our work is built upon it.

3 Background

Our work is based on the approach developed by Li et al. [12]. The main idea is to identify patterns of triples and further build summaries based on sentences having the largest number of patterns. Practically, triples are extracted from news articles, using OLLIE [25], in the form of \(\langle \)arg1; predicate; arg2\(\rangle \), for example \(\langle \)Donald Trump; fired; FBI director\(\rangle \). Then, a pattern is generated from each triple, annotating the heads of arguments arg1 and arg2 with their types. The annotation is done using Stanford NER [15] and SEMAFOR [11]. For example, annotating the triple \(\langle \)Donald Trump; fired; FBI director\(\rangle \) generates the pattern \(\langle \)PROTAGONIST; fired; PERSON\(\rangle \). As OLLIE may return clausal arguments, patterns with such arguments do not get their head annotated. For example \(\langle \)PERSON; killed by; a gunman who is on the loose\(\rangle \) has only one argument annotated but it is a valid pattern. Further, patterns are clustered into pre-defined themes specific for TAC 2010 guided summarization task. Examples of such themes include, “what happened”, “reason”, “damages”, and “countermeasures”. Then, from each cluster, a graph is constructed by fusing together all patterns in the cluster based on their POS and lemma, ignoring stop words. Finally, sentences are generated by traversing all possible paths in the graph, and then ranked based on their grammaticality and pattern coverage. The top ranked path in each cluster is then picked as the representative summary sentence for the cluster.

We observe that the approach by Li et al. [12] has three main problems. The first one is that it uses a predefined list of “themes” to group patterns, which is specific for some, but not all kinds of news articles. Therefore, we need to develop an alternative way to cluster patterns when the themes are unknown. Second, they rely on POS and lemma for pattern fusion. We think that this is problematic because it can potentially fuse unrelated patterns, which consequently generate incorrect sentences. For example, consider the two patterns: \(\langle \)Trump; told; Comey to stop his investigation about Russia\(\rangle \) and \(\langle \)Trump; fired; Comey because of his investigation about Hillary Clinton\(\rangle \). The graph fusion approach will generate the following result:

Fig. 1.
figure 1

An example of a fusion graph

Because the sentence ranking is based on pattern coverage and grammaticality, the chosen sentence would be “Trump told Comey to stop his investigation about Hillary Clinton” which is not correct w.r.t. the original news articles. The third problem is that pattern fusion relies on object typings returned by Stanford NER and SEMAFOR. Consider if Trump and Obama both appear in the original text. Both are of type PERSON, so during the fusion they will be merged together, which is not something that should happen because it often leads to incorrect summary sentence generation.

4 Improvements and Extensions

Our summarization approach follows the same line as the work of Li et al. [12], which we consider a baseline solution. Figure 2 shows an overview of our approach. Triple extraction and grouping, graph fusion, and ranking modules indicate the steps that are improved from the baseline. On the other hand, entity and verb linking modules indicate our extensions of the baseline.

Fig. 2.
figure 2

Improvements and extensions

4.1 Triple Extraction and Grouping

We start by extracting triples from a set of news articles. Similarly to the baseline approach, we use OLLIE [25] to extract triples of the form \(\langle \)arg1; predicate; arg2\(\rangle \). The summarization process starts by finding groups of similar triples, as it is crucial to find the sentences that have the same focus of the news. In other words, we aim at finding triples that tackle similar facets. Consider these triples:

figure b

We observe that T1 and T3 have the same first argument, the same predicate, and the same head of the second argument. Thus, they can be considered as most similar triples. However, T1 and T2 are more similar because both talk about the reason of the dismissal of Comey. The baseline approach tackled this problem by first grouping triples into predefined themes, such as “consequences” and “reasons”, relying on training data. Since this solution does not cover all news article datasets and is not flexible considering the dynamic nature of news, we propose an unsupervised approach to define triple groups (i.e., themes).

Our approach consists of three main steps. First, we use word2vec [17] trained on the Google News dataset to generate word embeddings for each word in each triple. Second, we enhance the generated word vectors by using PCA (Principal Component Analysis) as it was shown by [2] that this weighting improves the effectiveness of textual similarity tasks by \(10\%\) to \(30\%\), and outperforms sophisticated supervised methods. Third, we perform K-means clustering, a well established technique in machine learning, to create K clusters of similar triples based on the generated word vectors. The clustering technique starts by selecting randomly K triples as centroids and then maps all triples to the most similar centroid. The centroids then get updated and the process repeats until we obtain stable clusters. The number K of clusters reflects the size of the summary in terms of number of sentences. So, one representative sentence, which is a set of triples, is selected from each cluster to be part of the final summary.

4.2 Entity Linking and Predicate Similarity

When finding similar triples, mentioned entities are important. The heads of triple arguments are typically entities that can be either a person, an organization, a location, or any well-defined concept. The first issue with entity recognition is that existing tools do not have an agreement on what an entity is and therefore they might miss some important entities such as “Crimea”. The second issue is that entities are not always mentioned using their full names, but sometimes using abbreviations or only last names of people, which we call aliases. Interestingly, traditional Named Entity Recognition (NER) tools are not always able to recognize entities from aliases. The third issue is that NER approaches are not designed to detect entities that appear as coreferences. This is a problem for our work since we need to find similar triples. For example, there is no way to detect that the triples \(\langle \)Donald Trump; fired; FBI director James Comey\(\rangle \) and \(\langle \)He; fired; Comey\(\rangle \) have identical meaning because the entities are not resolved.

To overcome the above problems we follow our approach in [23], where we performed entity linking. We start by doing entity recognition, where we exploit DBpedia Spotlight [16], a large-scale knowledge base extracted from Wikipedia. It is a graph database that uses the RDF format. It represents Wikipedia categories as resources and uses the rdf:type predicate to state whether a resource is a class or an individual of a class. Using this property, we filter entities by removing all results produced by NER tools that have no property rdf:type in DBpedia Spotlight. Then, we introduce a name normalization technique that converts all aliases to normalized names to facilitate entity extraction. To begin, we extract entities from the news article using the entity filtering technique described earlier. For entities of type Person, we set as aliases first names, middle names, and last names. For other types, we find possible aliases using DBpedia Spotlight. As last step, we apply the Stanford Deterministic Coreference Resolution System [24] to map coreferences to their corresponding entities.

Another problem related to triple similarity are predicates. Some predicates, which are typically represented as verbs, have the same meaning. For example, the two triples \(\langle \)Donald Trump; fired; FBI director James Comey\(\rangle \) and \(\langle \)Donald Trump; dismissed; FBI director James Comey\(\rangle \) are basically the same. However, this cannot be detected if the two predicates are considered as two different words. To solve this problem, we use WordNet::Similarity [21] to detect similar predicates and use only one representative word for them. Since WordNet returns a similarity score for each pair of predicates, we set the similarity threshold high, concretely \(90\%\), so we only fuse verbs (predicates) that have very close meanings.

4.3 Fusion Graph and Strict Merging

As a first step, we follow the baseline approach to build a fusion graph for each group of similar triples or patterns. At the beginning we use patterns since we strictly follow the baseline approach. The graph is constructed by iteratively adding patterns to it, as shown in Algorithm 1. A node is added to the graph for each word (token) in the pattern, where consecutive words are linked with directed edges. When adding a new pattern, a token from the pattern is merged with an existing node in the graph providing that they have the same POS tag and they share the same lemma. An essential observation is that some words such as “he” and “his” have the same POS tag “PRP” and the same lemma, but they should not be merged together. Also, stopwords like “the”, “to”, and “of” should not be merged together in order to avoid noise. It is important to clarify that without annotation, the core of each pattern is simply a sentence of the original text. The structure of triples is used only to identify their predicates and the arguments, to perform head argument annotation and triple similarity checking.

figure c

We enhance the fusion graph by merging triples without annotation taking into account entity linking and predicate similarity when adding nodes. In other words, a token can now be entities or predicates, and therefore their linkings or similarities are involved during the merging process (the \(t \notin V\) in line 8–10 in Algorithm 1). We also employ strict merging, where merging is done only for matching entities and predicates, but not for other types of nodes. The idea is to avoid topic drift and concatenating triples that are not compatible. The example in Sect. 3 show that the fusion of the two triples \(\langle \)Trump; told; Comey to stop his investigation about Russia\(\rangle \) and \(\langle \)Trump; fired; Comey because of his investigation about Hillary Clinton\(\rangle \) might lead to the sentence “Trump told Comey to stop his investigation about Hillary Clinton” which is not correct.

4.4 Summary Sentence Selection

Sentences that compose the final summary are selected from the fusion graph. One path corresponds to one sentence. Paths are ranked based on two criteria: their grammaticality and their triple (or pattern) coverage. So, highly ranked paths should cover many paths which means that they summarize several facets of the same fact. Moreover, they should be grammaticality correct.

We implemented our own grammatical checker based on Stanford NLP and languagetools.org,Footnote 1 since the model used by the baseline was not publicly available. We performed a partial grammatical fix, focusing on the dangling verbs, i.e. verbs that are not correctly anchored to a subject, that are results of either OLLIE or the graph fusion. The fix is done by transforming the verb phrase into a well-formed clause using a relative pronoun (which, that, who, where, etc.) or a participle by analyzing the grammatical dependency to detect the occurrence of the dangling verbs and entity typing to determine the correct pronoun. Additionally, we analyze whether a dangling verb should be a passive voice by checking whether there exists a preposition that is connected to the verb as a nominal modifier. Finally, sentences without verbs are discarded.

We further enhanced path ranking exploiting, in addition to pattern coverage and grammaticality, node degrees. A node degree is the total number of both incoming and outgoing edges of the node. The idea is to select a path that has multiple important nodes, which are nodes having high degrees. Practically, our path ranking algorithm is a multi-step pairwise comparison in the following order: (1) pattern coverage, (2) node degree, and (3) grammaticality. For the node degree step, we compare first the average degree then the total degree of two paths. Finally, leveraging our grammaticality checker and fixer model, we set higher precedence in the following order: originally grammatical paths, grammatically fixable paths, and ungrammatical, non-fixable path.

5 Experiments

5.1 Setup

Datasets. For our evaluation, we used the DUC’04Footnote 2 and DUC’07Footnote 3 datasets, which represent one of the most important English corpora for summarization. The DUC’04 contains 50 news topics while the DUC’07 dataset provides 45 news topics. Each news topic contains 10 news articles and 4 human summaries. We also prepared a dataset for manual assessment of the quality of our summaries. The code of our work can be found in https://gitlab.inf.unibz.it/rprasojo/summarization.

Assessment. The results are assessed automatically. We basically compare the summaries generated by the different approaches under comparison with the human summaries. In the case of manual assessment done on the randomly generated 100 summary sentences, we proceeded as follows. We asked 20 students and researchers in our faculty to independently assess the coherence and correctness of the summary sentences on a scale between 1 to 5. After that, we computed the average score of each sentence. The correctness of sentences regards whether the reported information corresponds to what really happened. By contrast, coherence is about the correctness of the sentence structure.

Strategies Under Comparison. We used the approach by Li et al. [12] as the baseline for our experiments. This approach represents the starting point of our work. We performed further improvements and tested the impact of each extension on the results. So, we have the following strategies under comparison:

  1. 1.

    B. The baseline approach by Li et al. [12];

  2. 2.

    B+ EL. The baseline approach with Entity Linking;

  3. 3.

    B+ PL. The baseline approach with Predicate Linking;

  4. 4.

    B+ EL+ PL. The baseline approach with Entity and Predicate Linking;

  5. 5.

    B+ EL+ PL-T. The baseline approach with Entity and Predicate Linking but without Typing Annotation;

  6. 6.

    B+ EL+ PL-T+SM. The baseline approach with Entity and Predicate Linking, without Typing Annotation, and with Strict Merging;

  7. 7.

    B+ EL+ PL+SM. The baseline approach with Entity and Predicate Linking and Strict Merging.

Metrics. We have used the following measures in our evaluation:

  1. 1.

    ROUGE. The ROUGE measure  [14] consists in computing the overlap between automatically produced summaries and human produced summaries, which are considered as ground truth. The overlap between summaries is typically in terms of n-grams, where n is defined by the experiment setting. In our work, we used n-grams of size 1 and 2. The ROUGE metric is represented by two quantitative measures: Recall and Precision. We compute the Recall as the number of overlapping n-grams divided by the total number of n-grams present in the human produced summary. By contrast, the precision is given as the number of overlapping n-grams divided by the total number of n-grams in the automatic summary. In our experiments, we compute the F1 measure, that combines both precision and recall.

  2. 2.

    PYRAMID. PYRAMID scoring  [18] involves semantic matching of Summary Content Units (SCUs), so it can recognize semantically synonymous facts. We use the automated version proposed in [20], which leverages a weighted factorization model to transform the n-grams within sentence bounds of the generated summary, and the contributors and label of an SCU into 100 dimensional vector representation. If the similarity between an n-gram vector of a summary and an SCU exceeds a given threshold, then the SCU is assigned to the summary. We use the same setting described in [3], including the two threshold values 0.6 and 0.65.

5.2 Results

The overall results of our approach are shown in Tables 1 and 2. The ROUGE scores are shown in Table 1 for all strategies and datasets. We observe that our approach improves significantly the precision and F1 measure over the baseline approach. For the DUC’04 dataset we have an increase of precision of \(7\%\) and of F1 measure of \(11\%\) considering unigram matching R-1. These values naturally decrease for bigram matching R-2, but we still improve the precision and F1 measure by \(2\%\) and \(3\%\) respectively. The same observations hold for DUC’07 with very similar values for unigram matching R-1. We notice that the improvement is a bit higher for bigram matching R-2, where we have an increase of precision and F1 measure that is no less than \(3\%\). Having a closer look at the results of the different strategies we implemented, we observe that all our extensions improves precision and F1 measure with respect to the baseline approach. Each added extension increases the precision and F1 measure with a minimum of \(1\%\). We just note a very slight decrease in F1 measure for both datasets depending on whether we use typing annotations or not.

Table 1. ROUGE scores in %, where B = Baseline, EL = Entity Linking, PL = Predicate Linking, T = Typing, and SM = Strict Merging
Table 2. PYRAMID (DUC’07 in %)

We further computed the PYRAMID scores for the DUC’07 dataset as shown in Table 2. This dataset was the only one providing a ground truth with semantic annotation for PYRAMID scoring, while DUC’04 does not provide such a ground truth so a PYRAMID evaluation for it is not possible. We observe that our approach provides a significant improvement over the baseline that goes up to \(25\%\).

Besides the ROUGE and PYRAMID score, our evaluation shows how the contentedness of the graph evolves as a result of our improvements. Initially, the baseline relies on the typing in the graph fusion, which causes a highly liberal merging. For instance, if a cluster of patterns contain many <PERSON>, even though not referring to the same person, then some of the paths will be long (i.e. high pattern coverage), with many of them potentially resulting in an incorrect merging. By leveraging EL+VL, the typings are replaced with the corresponding entity and verb annotations. This causes less liberal merging. Evidently, after applying EL+VL, the average pattern coverage score goes down from 6.32 to 3.29 with standard deviation down from 3.02 to 0.87, so our graph becomes more compact and less convoluted. Combined with the ROUGE and PYRAMID scores, we can be confident that most paths that are “fixed” from the baseline are bad paths. Less variance on the pattern coverage score also means more usage of node degree ranking tiebreakers. We measured that it rises from 11% (baseline) to 80% (B+EL+VL+SM).

We also give an example of two summaries, one generated by the baseline approach and one by our approach. The summaries are of the news articles talking about Donald Trump firing the FBI director.

figure d
figure e

We observe that the summary provided by our approach has more correct sentences than the one by the baseline. More importantly, we can see a logical flow with our approach. The summary starts with the risks related to the Russian inverstigation, then moves smoothly to the consequence which is the dismissal of the FBI director. Then, it talks about the claimed motivation which is Clinton inverstigation, how Comey got the notification and how he felt. By contrast, the baseline summary talks about most of these issues but in almost random order.

5.3 Manual Evaluation

We manually assessed the coherence and correctness of our summary. These two metrics are best illustrated in the examples shown above. We can see that the sentence from the baseline “Comey was fired now at a time before he made Tuesday’s decision” is incorrect. Also the sentence “Donald Trump to leave office over the investigations into his administration’s links with Russian.” is incoherent. We ran our manual assessment on randomly selected 200 summary sentences, 100 for each approach (B and B+EL+VL+SM). Table 3 shows that our approach has a slightly better coherence and highly better correctness than the baseline.

Table 3. Manual evaluation result

The assessors have a high degree of agreement, with an average standard deviation of 0.87 per sentence. There are very few instances of polar differences, totaling 6 out of 200 sentences. The high coherence score for both approach shows that the graph fusion is able to keep coherence during the pattern merging process, even for the baseline. The assessors seem to give penalty to the coherence score when the sentence is less grammatical, which suggests that the partial grammatical fixer in our ranking model has some impact in increasing the coherence of the improved approach. On the other hand, the high increase in correctness suggests that the graph fusion is much more effective in correctly merging facts when leveraging entity and predicate linkings rather than typing.

6 Conclusions and Lessons Learned

We have proposed in this paper a summarization technique based on semantic triples and graph models starting from an existing baseline approach. We have proposed a series of improvements that help finding important facts mentioned in news articles together with their facets. We have shown that our linking techniques increase both the recall and F-measure. This suggests that our entity linking and predicate linking are more effective than the typing annotation used by the baseline in the graph fusion. Most of the entities and predicates were originally annotated with their types, causing incorrect merging during the fusion step. Our entity and predicate linking “replace” this annotation, which helps fixing incorrect merging. Removing the typing entirely seems to further increase the recall. However, adding strict merging on top of the typing annotation produces the best recall at the expense of the precision (and F1 measure). This suggests that if entity and predicate linking are employed, merging based on typing annotation is still better in terms of recall than merging non-annotated tokens. In terms of PYRAMID scores, our approach produces more Summary Content Units than the baseline. This strengthens the ROUGE results, showing that our improvements help producing summaries with more informative content. We can also see from the manual assessment that our approach has high coherence and correctness scores.