Answering Event-Related Questions over Long-Term News Article Archives
- 1 Citations
- 2.8k Downloads
Abstract
Long-term news article archives are valuable resources about our past, allowing people to know detailed information of events that occurred at specific time points. To make better use of such heritage collections, this work considers the task of large scale question answering on long-term news article archives. Questions on such archives are often event-related. In addition, they usually exhibit strong temporal aspects and can be roughly categorized into two types: (1) ones containing explicit temporal expressions, and (2) ones only implicitly associated with particular time periods. We focus on the latter type as such questions are more difficult to be answered, and we propose a retriever-reader model with an additional module for reranking articles by exploiting temporal information from different angles. Experimental results on carefully constructed test set show that our model outperforms the existing question answering systems, thanks to an additional module that finds more relevant documents.
Keywords
News archives Question answering Information Retrieval1 Introduction
Examples of questions in our test set, their answers, and dates of their events
Questions | Answers | Event dates |
---|---|---|
Which party, led by Buthelezi, threatened to boycott the South African elections? | Inkatha Freedom Party | 1993.08 |
What bill was signed by Clinton for firearms purchases? | Brady Bill | 1993.11 |
Which federal prosecutor that led the investigation for the leak of identity of Valerie Plame? | Patrick J. Fitzgerald | 2003.11 |
Riot in Los Angeles occurred because of the acquittal of how many officers in police department? | Four | 1992.04 |
Which American professional pitcher died because his small airplane crashed in New York? | Cory Lidle | 2006.10 |
This paper presents a large scale question answering system called QANA (Question Answering in News Archives) designed specifically for answering event-related questions on news article archives. It exploits the temporal information of a question, of a document content and of its timestamp for reranking candidate documents. In the experiments, we use New York Times (NYT) archive as the underlying knowledge source and a carefully constructed test set of questions which are associated with past events. The questions are selected from existing datasets and history quiz websites, and they lack any temporal expressions which makes them particularly difficult to be answered. Experimental results show that our proposed system improves retrieval effectiveness and outperforms the existing QA systems commonly used for large scale question answering.
We make the following contributions: (a) we propose a new subtask of QA, which uses long-term news archives as the data source, (b) we build effective models for solving this task by exploiting temporal characteristics of both questions and documents, (c) we perform experiments to prove their effectiveness and construct a novel dedicated test set for evaluating QA on news archives.
The remainder of this paper is structured as follows. The next section overviews the related work. In Sect. 3, we introduce our model. Section 4 describes experimental settings and results. Finally, we conclude the paper in Sect. 5.
2 Related Work
Question Answering System. Current large scale question answering systems usually consist of two modules: (1) IR module (called also a document retriever module) responsible for selecting relevant articles from an underlying corpus and (2) Machine Reading Comprehension (MRC) module (called also a document reader module) used to extract answer spans from relevant articles, typically, by using neural network models.
Latest MRC models, especially those that use Bert [3] can even surpass human-level performance (based on EM (Exact Match) and F1 scores) on both SQuAD 1.1 [4] and SQuAD 2.0 [5], the two most widely-used MRC datasets, where each question is connected with a given reading passage. However, recent studies [6, 7, 8] indicate that IR module is a bottleneck having a significant impact on the performance of the whole system (degraded performance of MRC component due to noisy input). Hence, few works tried to improve the IR task. Chen et al. [9] propose one of the most well-known large scale question answering system, DrQA whose IR component is based on a TF-IDF retriever that uses bigrams with TF-IDF matching. Wang et al. [7] introduce \(R^{3}\) model, where IR component and MRC component are trained jointly by reinforcement learning. Ni et al. [10] propose ET-RR model, which improves IR part by identifying essential terms of a question and reformulating the query.
Nonetheless, as the existing question answering systems are essentially designed for synchronic document collections (e.g., Wikipedia), they are incapable of utilizing temporal information like document timestamp when answering questions on long-term news article archives, despite temporal information constituting an important feature of events reported by news articles. The questions and documents are then processed in the same way as on synchronic collections. Even though some temporal question answering systems that can exploit temporal information of question and document content have been proposed in the past [11, 12], they are still designed for synchronic document collections (e.g., Wikipedia or Web) and they do not use document timestamps. Besides, they are based on traditional rule-based methods and their performance is rather poor.
In addition, there are very few resources available for temporal question answering. Jia et al. [13] propose a dataset with 1,271 temporal question-answer pairs where 209 pairs do is without any explicit temporal expression. However, only few pairs can be used in our case, as most are about events which happened long time ago (e.g., Viking Invasion of England) or are not event-related.
Our approach contains an additional module that is used for reranking documents which improves the retrieval of correct documents by exploiting temporal information from different angles. We not only utilize the inferred time scope information from the questions themselves, but also combine it with the document timestamp information and with temporal information embedded inside document content. To the best of our knowledge, no studies, as well as no available datasets that can help to design a question answering system on news article archives have been proposed so far. Building a system that makes full use of the past news articles and satisfies different user information needs is however of great importance due to the continuously growing document archives.
The architecture of QANA system
All the above-mentioned temporal information retrieval methods are designed for short queries instead of questions, and none of them exploits both timestamps and content temporal information. We are the first to adapt and improve concepts from temporal information retrieval to the QA research domain, showing significant improvement in answering questions on long-term news archives.
3 Methodology
In this section, we present the proposed system that is designed specifically for answering questions over news archives. We focus on questions for which the time periods are not given explicitly, and so further knowledge is required for obtaining or inferring their time periods (e.g. “Who replaced Goss as the director of the Central Intelligence Agency?”). Figure 1 shows the architecture of QANA system which is composed of three modules: Document Retriever Module, Time-Aware Reranking Module and Document Reader Module. Compared with the architectures of other common large scale question answering systems, we add an additional component: Time-Aware Reranking Module which exploits temporal information from different angles for selecting the best documents.
3.1 Document Retriever Module
Burst detection results of two questions (Color figure online)
3.2 Time-Aware Reranking Module
In this module, temporal information is exploited from different angles to rerank retrieved candidate documents. Since the time scope information of questions is not provided explicitly, the module firstly determines candidate periods of the time scope T(Q) of a question Q. These are supposed to represent when an event mentioned in the question could occur. Each inferred candidate period is assigned a weight to indicate its importance. Then, the module contrasts the query time scope against the information derived from the document timestamp \(t_{pub}(d)\) and the temporal information embedded inside document content \(T_{text}(d)\), in order to compute two temporal scores \(S_{pub}^{temp}(d)\) and \(S_{text}^{temp}(d)\) for each candidate document d. Finally, both the textual relevance score \(S^{rel}(d)\) and the final temporal score \(S^{temp}(d)\) are used for document reranking.
Query Time Scope Estimation. Although the time scope information of the questions is not given explicitly, the distributions of relevant documents over time should provide information regarding temporal characteristics of the questions. Examining the timeline of a query’s result set should allow us to characterize how temporally dependent the topic is. For example, in Fig. 2, the dashed lines of the data show the distribution of relevant documents obtained from the NYT archive per month for two example questions: “Lewinsky told whom about her relationship with the President Clinton?”, and “Which Hollywood star became governor of California?”. We use a cross mark to indicate the time of each corresponding event, which is also the true time scope of the question.
We can see that the actual time scope (January, 1988) of the first question is reflected relatively well by its distribution of relevant documents as generally these documents are located between 1998 and 1999. However, still most of the relevant documents are published in October rather than January, because another event - the impeachment of Bill Clinton - occurred at that time. On the other hand, the distribution of relevant documents corresponding to the second question is more complex as it contains many peaks, and documents are not located in a specific short time period, and the number of relevant documents published around the actual event time is relatively small when compared to the total number of relevant documents. However, the distribution line near the actual time of the event (November, 2003) still reveals useful features, i.e, the highest peak (maxima) of the dashed line of the data is near the event time. Therefore, the characteristics of the distribution of relevant documents over time can be used for inferring hidden time scopes of questions.
We perform burst detection on the retrieved relevant time-aligned documents, as the time and the duration of bursts are likely to signify the start point and the end point of events underlying the questions. More specifically, we apply burst detection method used by Vlachos et al. [25], which is a simple yet effective approach3. Bursts are detected as points with values higher than \(\beta \) standard deviations above the mean value of the moving average (MA). The procedure of calculating the candidate periods of time scope T(Q) of question Q is as follows:
\(T_{pub}(Q)\) can be easily obtained by collecting timestamp information of each retrieved candidate document, T(Q) is a list of tuples of \(t^{s}_{i}\) and \(t^{e}_{i}\), which are two border time points of the ith estimated time period. There are two parameters in our burst detection: w and \(\beta \). For simplicity, moving Average \(MA_{w}\) of \(T_{pub}(Q)\) of each question is calculated using w equal to 4, corresponding to four months. Following [25] that uses typical values of \(\beta \) equal to 1.5–2.0, we use 2.0 in the experiments. In Fig. 2, the red solid lines show the bursts of previously mentioned two example questions. The inferred time scope of the first question is [(‘1998-03’, ‘1999-05’)], while the time scope of the second question contains three periods: [(‘2003-08’, ‘2004-02’), (‘2004-06’, ‘2004-06’), (‘2004-09’, ‘2004-10’)]. Note that the second time period of the second time scope is actually a single time point (shown as a single small red point in the graph).
After calculating T(Q), each candidate period is assigned a weight indicating its importance, which is obtained by dividing the number of documents published within the period over the total number of documents published in all the candidate periods of time scope T(Q). For example, for the second example question, the number of documents published within the period (‘2003-08’, ‘2004-02’) is 43, while the total number of documents published within all the periods T(Q) is 55, so the weight assigned to this period is \(\frac{43}{55}\). We use W(T(Q)) to represent the weight list, such that \(W(T(Q))=[w(t^{s}_{1},t^{e}_{1}),...,w(t^{s}_{m},t^{e}_{m})]\), where m is the number of candidate periods of time scope T(Q).
Content-Based Temporal Score Calculation. Next, we compute another temporal score, \(S_{text}^{temp}(d)\), of a candidate document d based on the relation between temporal information embedded in d’s content and the candidate periods of time scope T(Q). We compute \(S_{text}^{temp}(d)\) because some news articles, even the ones published long time ago after the events mentioned in questions, may retrospectively refer to these events, providing salient information on them, and can thus help to distinguish between similar events. For example, articles published near a certain US presidential election may also discuss previous elections for comparison or for other purposes. Such references are often in the form of temporal expressions that refer to particular points in the past.
Temporal expressions are detected and normalized by the combination of temporal tagger (we use SUTime [29]) and temporal signals4 (words that help to identify temporal relations, e.g. “before”,“after”,“during”). The normalized result of each temporal expression is mapped to the time interval with the “start” and “end” information. For example, temporal expression “between 1999 and 2002” is normalized to [(‘1999-01’, ‘2002-12’)]. Special cases like “until January 1992” are normalized as [(‘’, ‘1992-01’)], since the “start” temporal information can not be determined. Finally, we can obtain a list of time scopes of temporal expressions contained in a document d, denoted as \(T_{text}(d)=\{\tau _1, \tau _2, ... ,\tau _{m(d)}\}\) where m(d) is the total number of temporal expressions found in d.
3.3 Document Reader Module
For this module, we utilize a commonly used MRC model called BiDAF [30] which achieves Exact Match score 68.0 and F1 score 77.5 on the SQuAD 1.1 dev set. We use BiDAF model to extract answers of the top N reranked documents and we select the most common answer as the final answer. Note that BiDAF could be replaced by other MRC models, for example, the models that combine with Bert [3]. We use BiDAF for easy comparison with DrQA, whose reader component’s performance is similar although a little better than the one of BiDAF.
4 Experiments
4.1 Experimental Setting
Document Archive and Test Set. As we mentioned before, NYT archive [31] is used as the underlying document collection, and is indexed using Solr. The archive contains over 1.8 million articles published from January 1987 to June 2007 and is often used for Temporal Information Retrieval researches [15, 16].
Performance of different models using EM and F1
Model | Top 1 | Top 5 | Top 10 | Top 15 | ||||
---|---|---|---|---|---|---|---|---|
EM | F1 | EM | F1 | EM | F1 | EM | F1 | |
DrQA-NYT [9] | 22.50 | 27.58 | 28.00 | 32.78 | 29.50 | 34.11 | 32.00 | 36.87 |
DrQA-Wiki [9] | 21.00 | 26.17 | 22.50 | 27.92 | 26.00 | 31.49 | 29.00 | 34.37 |
QA-NLM-U [21] | 23.50 | 30.54 | 33.00 | 39.71 | 41.00 | 48.02 | 43.00 | 50.71 |
QA-Not-Rerank [30] | 25.50 | 32.45 | 30.00 | 37.84 | 40.50 | 47.32 | 42.00 | 48.95 |
QANA-TempPub | 26.00 | 33.69 | 36.00 | 42.75 | 39.50 | 47.19 | 44.00 | 50.71 |
QANA-TempCont | 22.50 | 29.70 | 32.50 | 40.67 | 41.50 | 49.05 | 44.50 | 51.09 |
QANA | 26.50 | 34.27 | 37.00 | 43.76 | 42.00 | 49.20 | 45.50 | 52.71 |
- 1.
DrQA-NYT [9]: DrQA system which uses NYT archive.
- 2.
DrQA-Wiki [9]: DrQA system which uses Wikipedia as its unique knowledge source. We would like to test if Wikipedia could be sufficient for answering questions on events distant in the past.
- 3.
QA-NLM-U [21]: QA system that uses the best reranking method in [21], while the Document Retriever Module and Document Reader Module are the same as the modules of QANA.
- 4.
QA-Not-Rerank [30]: QANA system without reranking module, same as other large scale question answering systems. The Document Retriever Module and Document Reader Module are the same as the modules of QANA.
- 5.
QANA-TempPub: QANA version that uses only temporal information of timestamp for reranking in Time-Aware Reranking Module.
- 6.
QANA-TempCont: QANA version that only uses temporal information embedded in document content for Time-Aware Reranking Module.
- 7.
QANA: QANA with complete Time-Aware Reranking Module.
4.2 Experimental Results
Performance of the models when answering questions having few relevant documents vs. when answering questions with many relevant documents
Top 1 | Top 5 | Top 10 | Top 15 | ||||||
---|---|---|---|---|---|---|---|---|---|
EM | F1 | EM | F1 | EM | F1 | EM | F1 | ||
Questions with few relevant documents | QA-Not-Rerank | 31.00 | 40.48 | 35.00 | 43.93 | 46.00 | 55.79 | 48.00 | 55.12 |
QANA | 31.00 | 40.52 | 45.00 | 54.18 | 48.00 | 57.28 | 52.00 | 59.22 | |
Questions with many relevant documents | QA-Not-Rerank | 20.00 | 24.41 | 25.00 | 31.75 | 35.00 | 42.86 | 36.00 | 42.84 |
QANA | 22.00 | 28.02 | 29.00 | 33.33 | 36.00 | 41.11 | 39.00 | 46.21 |
We have also examined the performance of DrQA when using Wikipedia articles as its knowledge source. In this case, the results are worse than the ones of any other compared method that uses NYT (including DrQA), which implies that Wikipedia cannot successfully answer questions on distant past events, and they need to be answered using primary sources, i.e., news articles from the past.
When comparing with QA-NLM-U [21], the improvement ranges from 12.76% to 12.12% on EM score, and 12.21% to 10.19% on F1 score. In addition, when comparing with QA-Not-Rerank [30] that does not include reranking module, we can also observe an obvious improvement, when considering the top 5 and top 15 documents, ranging from 23.33% to 8.33%, and from 15.64% to 7.11% on EM and F1 metrics, respectively. Moreover, QANA-TempPub performs better than QANA-TempCont when using the top 1 and top 5 documents, but worse when using top 10 and top 15. In addition, we can observe that just using only timestamp information still allows achieving relatively good performance. Nevertheless, QANA with all the proposed components, which make use of the inferred time scope of the questions and the temporal information from both document timestamps and document content, achieves the best results.
Performance of the models when answering questions with few bursts vs. when answering questions with many bursts
Top 1 | Top 5 | Top 10 | Top 15 | ||||||
---|---|---|---|---|---|---|---|---|---|
EM | F1 | EM | F1 | EM | F1 | EM | F1 | ||
Questions with few bursts | QA-Not-Rerank | 30.20 | 37.24 | 38.54 | 44.32 | 45.83 | 52.55 | 50.00 | 56.79 |
QANA | 30.20 | 38.11 | 42.70 | 48.55 | 46.87 | 54.98 | 52.08 | 58.96 | |
Questions with many bursts | QA-Not-Rerank | 21.15 | 28.10 | 22.11 | 31.87 | 35.57 | 40.16 | 34.61 | 41.74 |
QANA | 23.07 | 30.72 | 31.73 | 39.33 | 37.50 | 43.86 | 39.42 | 46.95 |
Moreover, we also analyze the impact of the number of bursts on performance. About half of the questions (96 questions) have few bursts (less than or equal to 4). Table 5 shows that both QANA and QA-Not-Rerank perform much better when answering such questions. The events in questions with many bursts are likely to be similar to other events that occurred at different times, which causes the difficulty to distinguish between the events. As our system considers the importance of bursts by assigning weights to them, it significantly outperforms QA-Not-Rerank. Although \(\alpha (Q)\) is smaller in this case (according to Eq. 9), it still plays an important part in selecting relevant documents. For example, if the number of bursts of a question is 10, \(\alpha (Q)\) approximately equals to 0.1, which means that the final reranking is driven by about 10% of the temporal score.
QANA performance with different static alpha values vs. one with dynamic alpha for the top 5 reranked documents
5 Conclusions
In this work we propose a new research task of answering event-related questions on long-term news archives and we show effective solution for it. Unlike other common QA systems designed for synchronic document collections, questions on long-term news archives are usually influenced by temporal aspects, resulting from the interplay between the document timestamps, temporal information embedded in document content and query time scope. Therefore, exploiting temporal information is crucial for this type of QA, as also demonstrated in our experiments. We are also the first to incorporate and adapt temporal information retrieval approaches to QA systems.
Finally, our work makes few general observations. First, to answer event-related questions on long-span news archives one needs to (a) infer the time scope embedded within a question, and then (b) rerank documents based on their closeness and order relation to this time scope. Moreover, (c) using temporal expressions in documents further helps to select best candidates. Lastly, (d) applying dynamic way to determine the importance between query relevance and temporal relevance is quite helpful.
Footnotes
- 1.
- 2.
We use Glove [23] embeddings trained on the Common Crawl dataset with 300 dimensions.
- 3.
- 4.
We use the list of temporal signals taken from [13].
- 5.
The test set is available at https://www.dropbox.com/s/ygy7xy4k80wmcfl/TestQuestion.csv?dl=0.
- 6.
We note that we have also tested QANA on 200 separate questions containing explicit temporal expressions, hence with time scopes directly given, and found that it outperforms the same baselines with even better results.
References
- 1.Korkeamäki, L., Kumpulainen, S.: Interacting with digital documents: a real life study of historians’ task processes, actions and goals. In: Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, CHIIR ’19, pp. 35–43. ACM, New York (2019). https://doi.org/10.1145/3295750.3298931. ISBN 978-1-4503-6025-8
- 2.Bogaard, T., Hollink, L., Wielemaker, J., Hardman, L., Van Ossenbruggen, J.: Searching for old news: user interests and behavior within a national collection. In: Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pp. 113–121. ACM (2019)Google Scholar
- 3.Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- 4.Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
- 5.Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018
- 6.Buck, C., et al.: Ask the right questions: Active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830 (2017)
- 7.Wang, S., et al.: R3: reinforced ranker-reader for open-domain question answering. In: AAAI (2018)Google Scholar
- 8.Yang, W., et al.: End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718 (2019)
- 9.Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017)
- 10.Ni, J., Zhu, C., Chen, W., McAuley, J.: Learning to attend on essential terms: An enhanced retriever-reader model for scientific question answering. arXiv preprint arXiv:1808.09492 (2018)
- 11.Pasca, M.: Towards temporal web search. In: Proceedings of the 2008 ACM symposium on Applied computing, pp. 1117–1121. ACM (2008)Google Scholar
- 12.Harabagiu, S., Bejan, C.A.: Question answering based on temporal inference. In: Proceedings of the AAAI-2005 workshop on inference for textual question answering, pp. 27–34 (2005)Google Scholar
- 13.Jia, Z., Abujabal, A., Saha Roy, R., Strötgen, J., Weikum, G.: TempQuestions: a benchmark for temporal question answering. In: Companion of the Web Conference 2018, pp. 1057–1062. International World Wide Web Conferences Steering Committee (2018)Google Scholar
- 14.Alonso, O., Gertz, M., Baeza-Yates, R.: On the value of temporal information in information retrieval. In: ACM SIGIR Forum, vol. 41, pp. 35–41. ACM (2007)Google Scholar
- 15.Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)CrossRefGoogle Scholar
- 16.Kanhabua, N., Blanco, R., Nørvåg, K.: Temporal information retrieval. Found. Trends Inf. Retrieval 9(2), 91–208 (2015). https://doi.org/10.1561/1500000043CrossRefGoogle Scholar
- 17.Li, X., Croft, W.B.: Time-based language models. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 469–475. ACM (2003)Google Scholar
- 18.Metzler, D., Jones, R., Peng, F., Zhang, R.: Improving search relevance for implicitly temporal queries. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 700–701. Citeseer (2009)Google Scholar
- 19.Arikan, I., Bedathur, S., Berberich, K.: Time will tell: leveraging temporal expressions in IR. In: WSDM. Citeseer (2009)Google Scholar
- 20.Berberich, K., Bedathur, S., Alonso, O., Weikum, G.: A language modeling approach for temporal information needs. In: Gurrin, C., et al. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 13–25. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12275-0_5CrossRefGoogle Scholar
- 21.Kanhabua, N., Nørvåg, K.: Determining time of queries for re-ranking search results. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 261–272. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15464-5_27CrossRefGoogle Scholar
- 22.Miller, G.A.: WordNET: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
- 23.Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
- 24.Grainger, T., Potter, T., Seeley, Y.: Solr in Action. Manning, Cherry Hill (2014)Google Scholar
- 25.Vlachos, M., Meek, C., Vagena, Z., Gunopulos, D.: Identifying similarities, periodicities and bursts for online search queries. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 131–142. ACM (2004)Google Scholar
- 26.Fung, G.P.C., Yu, J.X., Yu, P.S., Lu, H.: Parameter free bursty events detection in text streams. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 181–192. VLDB Endowment (2005) Google Scholar
- 27.Snowsill, T., Nicart, F., Stefani, M., De Bie, T., Cristianini, N.: Finding surprising patterns in textual data streams. In: 2010 2nd International Workshop on Cognitive Information Processing, pp. 405–410. IEEE (2010)Google Scholar
- 28.Kleinberg, J.: Bursty and hierarchical structure in streams. Data Min. Knowl. Discov. 7(4), 373–397 (2003)MathSciNetCrossRefGoogle Scholar
- 29.Chang, A.X., Manning, C.D.: SUTime: a library for recognizing and normalizing time expressions. In: LREC, vol. 2012, pp. 3735–3740 (2012)Google Scholar
- 30.Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016
- 31.Sandhaus, E.: The new york times annotated corpus. Linguist. Data Consortium, Philadelphia 6(12), e26752 (2008)Google Scholar