Keywords

1 Introduction

Due to the explosive growth of the amount of content on the Internet, the problems of extraction and automatically summarizing useful information in the incoming data stream arises. One of such problems is the summarization of news articles on an event. The news story - is a set of news reports from various sources dedicated to describing an event. Such problems are often investigated and solved by news aggregators, for example, Google.NewsFootnote 1 [17] or Yandex.News.Footnote 2 This is due to the fact that to work with such problems the researcher needs a huge and diverse collection of news articles.

The typical “lifetime” of the news story (the time of active discussion of the event) is usually a day or two, but not all events are so short. Some news stories have a “history” in the form of a set of previous events that occurred at different moments and are more or less related to each other. Existing multi-document summarization approaches do not take into account the fact that the context, actors, geography and other event properties can vary over time.

The fact that journalists are returning to the same events, for example, with the appearance of new data, indicates that such events are important for the society. The need for a brief summary of the event raises the problem of forming a “timeline summary”. Timeline summary is a type of multi-document summary, containing the essential details of the subject matter under discussion. The construction of such annotations is a complex task, performed by journalists or analysts manually. This implies that the automation of such a process is a urgent problem.

In this paper we consider challenges and solutions for the automatic generation of temporal summaries. We consider this problem as a multi-document summarization on a query over a representative collection of news documents. The query in this case is the text of the news message. The situation corresponds to the scenario when a user would like to receive a timeline summary after reading the news document. The result should be a time-ordered list of descriptions of the key sub-events related to main event. The result consists of parts of existing sentences, since our solution refers to extractive summarization approaches.

A system was developed to automate the timeline summarization process. Experiments were conducted over a collection of 2 million Russian news for the first half of 2015. Three new factors were investigated to improve the results of constructing a timeline summary: query extension using pseudo-relevance feedback, accounting for the timing characteristics of news stories and the structure of the inverted pyramid.

This is a follow-up study of timeline summarization problem reported in previous paper [25]. In this study, we expanded the collection of standard annotations three-fold. The evaluation process was improved by dividing the collection into a training and test parts. An optimization module was added for fitting the configurations. As a result, substantial progress was achieved. Taking into account the structure of the inverted pyramid showed a significant increase in the values of metrics, which was not achieved in the previous article.

2 Related Work

2.1 Automatic Text Summarization Problem

Currently, there are quite a number of methods for automatic text summarization [3]. Some methods that use large linguistic ontologies [12, 15], that may be automatically supplemented during the analysis. Other methods are based on the statistical properties of texts [16] or machine learning [13].

During the generation of the annotations, the following problems occur [3, 7, 11]:

  • Ensuring the completeness of the presentation of information, including the most up-to-date information.

  • Decreasing of redundancy in the information provided.

  • Ensuring the coherence and understandability of the information provided.

To ensure the completeness of the resulting annotation, it is often necessary to find links between sentences or documents [20].

To determine the redundancy in the generated annotations, various measures of similarity between sentences are used. One of the most common approaches is clustering - the selection of content groups of sentences [6]. Another approach to reduce redundancy is to compare candidate sentence with sentences that have already been included in the summary and to evaluate novel information. Example of such approach is the Maximal Marginal Relevance (MMR) [2].

The problem of ensuring the coherence of information in the summary arises both in the methods of generating the annotation [18, 19], and in the methods of evaluation, because in order to assess the connectivity and linguistic qualities of the annotation, it is necessary to perform a manual evaluation.

2.2 Timeline Summary

The problem of timeline summary construction has a number of differences from the standard summarization problem. For example, the temporal nature of events must be taken into account [9]. Also, to ensure completeness of the information provided, it is required to find documents from all sub-events of the topic under consideration.

When constructing timeline summary, data processing is mainly carried out over huge collections. In such collections, most of the information is not relevant to the user’s request. This problem can be solved by using clustering methods [10, 14]. But the clustering methods have some issues. First, such a task should be solved many times over huge collections of documents, which affects the response time of the system. Secondly, the degree of closeness can be significantly smaller with standard measures of similarity for documents that describe far-in-time but related events. And, of course, it is required to identify the most characteristic objects [1, 9], for example, taking into account the structural features of the flow of documents [5, 8].

3 Statement of the Problem

3.1 General Description

The problem of constructing a timeline summary is a query-oriented. In the most general case, the user has a news document as a query. So further this problem will be considered as a problem of automatic creation of a summary on a query in the form of a text document. The output of the system is an annotation of n sentences. The connectivity between the sentences in this paper is not required. Figure 1 provides an example of a possible summary about the conflict on cemetery taken from the Interfax website.Footnote 3

Fig. 1.
figure 1

Timeline summary part about conflict at cemetery.

The aim of the work is to study the influence of various factors on the quality of the annotation.

3.2 Mathematical Statement of the Problem

The problem described above can be formalized in the following way. Let \( Q = \left\{ {q_{1} , q_{2} , \ldots , q_{m} } \right\} \) be a set of queries and an associated set of reference annotations \( D_{g} = \left\{ {D_{g}^{{q_{1} }} , D_{g}^{{q_{2} }} , \ldots , D_{g}^{{q_{m} }} } \right\} \) be an associated set of reference annotations. The system generates a set of summary \( D_{A} = \left\{ {D_{A}^{{q_{1} }} , D_{A}^{{q_{2} }} , \ldots , D_{A}^{{q_{m} }} } \right\} \) in response to queries \( Q \) by algorithm \( A \). Then the problem is reduced to maximizing the following functional:

$$ \frac{{\mathop \sum \nolimits_{i = 1}^{i = \left| Q \right|} M\left( {D_{A}^{{q_{i} }} , D_{g}^{{q_{i} }} } \right)}}{\left| Q \right|} \to max $$
(1)

where \( M \) is the proximity function between the annotations. Optimization is carried out for all parameters of the algorithm.

4 Approach

4.1 Collection Processing

As mentioned earlier, the input collection contains 2 million news articles. It is not possible to work directly with such amount of information, therefore, it was decided to interact with the collection through a search engine. Search engine allows:

  • Get a list of documents by text request.

  • For a given document from collection, get the basic information: text, index, meta-information.

4.2 Studied Features

In this paper the following factors were investigated:

  • Query extension strategy.

  • Accounting for the temporal nature of news stories.

  • Accounting for the structure of a news article in the form of an inverted pyramid.

4.3 Query Extension Strategy

Information that can be obtained from a query document is basically not enough to effectively build this type of annotation. This fact is a consequence of the fact that most news articles are not a general description of the event, but a discussion of some particular incident or fact.

To avoid this problem, it is necessary to use the query extension techniques. The developed algorithm uses the idea of pseudo-relevance feedback, which is widely used in information retrieval problems [21]. For the query-document, the algorithm includes the following steps:

  1. 1.

    The most significant K-terms are chosen on the basis of tf-idf weights forming thus the first-level query.

  2. 2.

    Further on the basis of the first-level query documents are retrieved.

  3. 3.

    The extracted cluster of documents is analyzed to find the most important terms forming thus the second-level query:

    1. a.

      For each document, the most significant \( T \) terms are considered.

    2. b.

      For each term, it is calculated how often it was in the top \( T \) terms on all cluster.

    3. c.

      The list of terms is sorted by frequency, the best \( M \) terms are selected.

  4. 4.

    Steps 2–3 are repeated (A double query extension process that forms a third-level query).

  5. 5.

    Output of the algorithm is a vector of \( N \) terms representing to some extent the semantics of the input document.

Note that \( K, T, M, N \) are parameters of the algorithm and they must be configured. As an example of the work of the query extension module, consider the algorithm steps on a news article about the terrorist attack in Paris (Table 1).

Table 1. Query extension algorithm stages example.

The table shows that a higher-level query has more significant terms for this event.

4.4 Temporal Nature of News Stories

Since any event depends on time, the content of publications and their number also depend on the time. As an example, Fig. 2 shows a graph of the time dependence of publications on the “Earthquake in Nepal” event.

Fig. 2.
figure 2

Dependence of the number of publications per day for an event

To take into account this factor, for the set \( D \) of extracted documents the following procedure is undertaken:

  1. 1.

    The entire timeline of the event is divided into days with labels \( T = \left\{ {t_{1} , t_{2} , \ldots , t_{n} } \right\} \).

  2. 2.

    Each document receives a label from \( T \) based on the publication date \( D_{i}^{t} \).

  3. 3.

    Documents published on days with a number of publications less then the \( NDoc_{tr} \) threshold (2) are discarded.

    $$ NDoc_{tr} = 0.2*MEAN_{top\,3} \left( D \right) $$
    (2)
  4. 4.

    The output is a sorted list of collections \( C = \left\{ {C_{{t_{1} }} , C_{{t_{2} }} , \ldots , C_{{t_{n} }} } \right\} \), where each collection \( C_{{t_{i} }} \) contains only documents with the label \( t_{\text{i}} \).

4.5 Inverted Pyramid

The strategy of writing a high-quality news article often relies on the structure of the “inverted pyramid” (Fig. 3). The greatest interest is the upper and lower parts of the pyramid:

Fig. 3.
figure 3

Inverted pyramid on the example of an article. (https://themoscowtimes.com/articles/moscow-museum-takes-you-inside-north-korea-60240)

  • The upper part contains the most concentrated information about the event under discussion.

  • The lower part may contain references to important related events in the past.

This structure is taken into account in two ways:

  1. 1.

    Inter-document feature based on the graph approach.

  2. 2.

    Intra-document feature, which increases the weight of sentences located in the upper and lower parts of the inverted pyramid.

Inter-Document Feature.

This feature is taken into account in the following way:

  1. 1.

    For a set of documents \( D \), a similarity matrix between the upper and lower parts of the documents is constructed. If the specified similarity threshold is exceeded, it is considered that there is a link between the documents \( D_{i} \) and \( D_{j} \).

  2. 2.

    The importance of documents is calculated by using the LexRank algorithm over the constructed graph [4].

  3. 3.

    For documents whose weight is greater than a certain threshold, the previously described procedure for expanding the query is performed.

As a result, the output is a ranked list of documents \( D \) and a set of \( Q_{D} \) new queries, which further, together with accounting for the temporal nature of the news story, will help in sentence ranking algorithm. Among other things, document weights will also be taken into account in the ranking functions.

Intra-Document Feature.

To this feature into account for the following procedure is undertaken: during the ranking of sentences, the weight of the sentence is multiplied by a coefficient that lowers the weight of sentences in the middle of the document.

Also, after described inter-document procedure, all constructed extended queries \( Q_{D} \) are mapped to \( t_{\text{i}} \) labels from \( T \) (Fig. 4).

Fig. 4.
figure 4

Query mapping.

4.6 Similarity of Sentences

At various stages of the algorithm, there are a number of points where the measure of closeness between sentences is calculated. For these purpose a cosine measure of similarity (3) is used in all cases.

$$ Sim_{cos} \left( {S_{i} , S_{j} } \right) = \frac{{\left( {S_{i} , S_{j} } \right)}}{{\left| {S_{i} \left| * \right|S_{j} } \right|}} $$
(3)

The choice of representation of a sentence plays an important role for calculating similarity. In this article we used the standard tf-idf representation. But to calculate the similarity between sentences when searching for links between documents, word2vec [24] representation was used. To achieve this, the resulting sentence vector is represented as a weighted mean of word2vec word representations. Weighing was carried out by tf-idf.

Word2vec model was trained on the entire collection of 2 million news articles. During preprocessing removal of stop words and lemmatization were applied. The width of the window was chosen to be 5, and the length of the vector was 100.

4.7 Sentence Ranking Module

This module deals with the ranking of sentences. The ranking is a modified version of the \( MMR \) algorithm – \( MMRT \) (4) taking into account all the factors described in Sect. 4.2:

$$ MMRT_{{s_{i}^{t} }} = INC_{{s_{i}^{t} }} - DEC_{{s_{i}^{t} }} $$
(4)

where \( INC_{{s_{i}^{t} }} \) – is a term describing the positive part of the formula, which depends on the similarity of the sentence to the query, the weight of the document from which the sentence is taken, and the sentence number in the document.

$$ INC_{{s_{i}^{t} }} = \left( {1 + \alpha *I_{i} } \right)* \gamma * \lambda *Sim\left( {Q^{t} , S_{i}^{t} } \right) $$
(5)
$$ \gamma = 1 - 0.5*{ \sin }\left( {\frac{i* \pi }{{\left| {D_{s} } \right|}}} \right) $$
(6)

The parameters \( \alpha \) and \( \lambda \) are configurable parameters of the algorithm, \( I_{i} \) – is document weight \( D_{s} \), which includes a sentence under the index \( i \), \( S_{i}^{t} \) – is estimated sentence under the index \( i \) with label \( t \), \( Q^{t} \) – query for this time label, \( \gamma \) – multiplier, which reduce the weight of sentences from the middle of the document.

\( DEC_{{s_{i}^{t} }} \) is the penalty term. It depends on the similarity to the already extracted sentences:

$$ DEC_{{s_{i}^{t} }} = \left( {1 - \lambda } \right)*\mathop {\hbox{max} }\nolimits_{{S_{j} \in S}} Sim(S_{j} , S_{i}^{t} ) $$
(7)

where \( S_{j} \) is one of the extracted sentences, \( S \) is the set of all already extracted sentences.

Processing of sentences occurs in chronological order with a restriction on the maximum number of sentences per day.

4.8 System Diagram

The features described in Subsect. 4.2 are realized at various stages of the system. The general scheme of the algorithm is shown on Fig. 5.

Fig. 5.
figure 5

Working scheme.

5 Evaluation

5.1 Metrics for Evaluation

The system was evaluated using several metrics: ROUGE-1, ROUGE-2, and Sentence Recall \( R^{sent} \):

$$ ROUGE - N = \frac{{\left| {N_{A} \cap N_{g} } \right|}}{{\left| {N_{g} } \right|}} $$
(8)

where \( N_{A} \) is the set of n-grams for the constructed annotations, \( N_{g} \) is the set of n-grams for the reference (gold) annotations.

$$ R^{sent} = \frac{{\left| {S_{A} \equiv S_{g} } \right|}}{{\left| {S_{g} } \right|}}, $$
(9)

where \( S_{A} \) is the set of sentences from the constructed annotations, \( S_{g} \) is the set of sentences from the reference annotations. Operator ≡ denotes the following: the result of the \( \left| {S_{A} \equiv S_{g} } \right| \) is a subset of \( S_{A} \) such that their semantic equivalent is present in \( S_{g} \).

5.2 Data Preparation

Since a test set of annotations is required for evaluating procedure, in the course of the research, timeline summaries were manually prepared. The procedure for the formation of such a collection was as follows:

  1. 1.

    At the first stage with the help of Wikipedia there high-profile events were selected, which were actively covered in the press for the beginning of 2015.

  2. 2.

    Further, for most of the events on the site “Interfax”, the search for the corresponding story was carried out. On the basis of documents corresponding to the story, a timeline summary was created.

  3. 3.

    If there is no corresponding story on the “Interfax”, the materials were studied on the topic and a timeline summary was created on the basis of the documents read.

As a result, 45 annotations on 15 news stories were created (Table 2).

Table 2. News stories on which the reference annotations are made.

5.3 Optimization of Algorithm Parameters

Since the system contains a large number of parameters (total 23 parameters), some of which are presented in Table 3, there was a need to optimize the choice of the values of these parameters.

Table 3. Some system parameters.

To achieve this, the entire collection of the reference annotations was divided into train and test parts with the ratio 2 to 1. Further, the functional (1) was implemented in Python using an open hyperopt [22] package based on machine learning. This package uses the technique of Sequential model-based optimization (SMBO) [23] for the parameters selection. The parameters were trained on the training part. After that, the final evaluation of the configurations took place on the test part.

6 Results

In order to evaluate the contribution of the considered features, a fitting and evaluation of the following 6 configurations was made:

  1. 1.

    baseline – a simple approach to summarization, without taking into account the factors considered, using MMR as ranking algorithm.

  2. 2.

    querry-ex – adding a query extension strategy feature to baseline (Sect. 4.3), but without double query extension.

  3. 3.

    double-exquerry-ex + double query extension (Sect. 4.3).

  4. 4.

    temporaldouble-ex + accounting for the temporal nature of news stories (Sect. 4.4).

  5. 5.

    importancetemporal + accounting for the structure of a news article in the form of an inverted pyramid, when tf-idf representation is used (Sect. 4.5).

  6. 6.

    w2v-impimportance, but using w2v for computing sentence similarity when accounting for the structure of a news article (Sect. 4.6).

The result of evaluation of the configurations can be seen in Table 4. This table shows that each of the features considered gives a positive contribution to the quality of generation of timeline summary. As an example of the final annotation, one can consider a fragment of the annotation on the previously mentioned incident of the crash in Taiwan in Table 5.

Table 4. Evaluation results.
Table 5. The generated timeline summary fragment about the plane crash in Taiwan.

7 Conclusions and Future Work

In this article we presented an approach for building a timeline summary. The conducted research shows that the problem of constructing the timeline summary differs from the standard MDS problem. The effectiveness of using the following features was shown:

  • Query extension strategy.

  • Accounting for the temporal nature of news stories.

  • Accounting for the structure of a news article in the form of an inverted pyramid.

Extending the query, as expected, has a positive effect on the event representation discussed in the document. But the interesting fact is that re-extension the query (double query extension) has a much greater effect. This is because the documents that are retrieved on the first-level query are not sufficient for a good presentation of the event.

The fact that accounting for the temporal nature of news stories improves the quality of the annotation is an obvious consequence of the fact that news stories and events have temporal characteristics.

Taking into account the structure of the inverted pyramid gives an improvement. Increase the values of metrics on the w2v-imp configuration means that the correctness of the recognized links between the documents plays a significant role. This fact raises challenges for future research.

Using structural features of news articles make it possible to obtain information, the use of which can significantly improve the quality of generated annotations.