Abstract
It is useful for machines to know whether text information remains valid or not for various applications including text comprehension, story understanding, temporal information retrieval, and user state tracking on microblogs as well as via chatbot conversations. This kind of inference is still difficult for current models, including also large language models, as it requires temporal commonsense knowledge and reasoning. We approach in this paper the task of Temporal Validity Reassessment, inspired by traditional natural language reasoning to determine the updates of the temporal validity of text content. The task requires judgment whether actions expressed in a sentence are still ongoing or rather completed, hence, whether the sentence still remains valid or has become obsolete, given the presence of context in the form of a supplementary content such as a follow-up sentence. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives information regarding temporal commonsense knowledge. Using our prepared dataset, we introduce a machine learning model that incorporates the information from the knowledge base and demonstrate that incorporating external knowledge generally improves the results. We also experiment with different embedding types to represent temporal commonsense knowledge as well as with data augmentation methods to increase the size of our dataset.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
It is rather easy for humans to reason on the validity of sentences. Consider a sentence made by a hypothetical user which could be for example sent as a microblog post: “Now waiting for a train to Munich, should be arriving here soon.”, and then that the user issues a subsequent post: “Again delay?, what’s wrong with DB trains these days?!”, we can guess that the person is very likely to be still waiting. That is, the action (i.e., waiting) stated in the former message is still ongoing, thus, the first sentence remains valid. On the other hand, if the subsequent post would be “Finally! So, Goodbye Berlin!”, it would be highly possible that the first message (the one about waiting) is no longer valid in view of this additional evidence. Taking another example, a person sees the title of a news article as follows: “Polish PM visits White House” while two days later another news title catches her eye: “Polish Prime Minister Donald Tusk speaks during a press conference after a government meeting in Warsaw”. The latter information implicitly indicates the completion of the PM’s visit to USA, i.e., the former sentence is no longer valid; however, this would not be the case if the press release had been, for instance, in New York. Humans can do this kind of inference quite well given the commonsense and our world knowledge, while for machines such sentences may still pose challenges.
Computers have demonstrated significant advantages in the field of natural language comprehension as a result of the emergence of pre-training models [1]. However, it still remains difficult for machines to perform effective reasoning that requires common sense [2]. As the amount of available text information has exploded in the modern digital era, it has become increasingly crucial for machines to acquire a much deeper understanding of natural language.
We investigate a novel temporal commonsense reasoning task called Temporal Validity Reassessment (TVR) which involves reasoning about stationarity and action continuation or abortion/completion based on additional related content considered as evidence. As in the previous example, the temporal validity of the sentence “Now waiting for a train to Munich, should be arriving here soon.” needs to be judged in view of supplementary content, which in social media scenarios is assumed to be published after the first sentence (see Fig. 1). Note that this task involves a pair of sentences where one is a hypothesis and the other is premise, hence, there is a natural resemblance to Natural Language Inference (NLI) taskFootnote 1 [3], and so TVR could be actually also called Temporal Natural Language Inference as in our prior publication [4] on which the current paper is based. While the concept of temporal validity estimation is general, that is, any size of content can be considered as input, for the sake of simplicity, we take pairs of sentences as an input, which is also in line with an input to NLI task.
TVR could be both beneficial for diverse Natural Language Processing (NLP) tasks as well as for improving Information Retrieval (IR) since estimating obsoleteness and timeliness of search results could be then improved leading to more efficient temporal information retrieval models [5]. Sec. 6 expands the discussion on the applications of approaches for reasoning about text temporal validity.
Similar to NLI, we address TVR problem as a classification task using the two input sentences: hypothesis and premise. Note that the order is opposite when compared to NLI task where usually premise was given as the first sentence followed by the hypothesis and the relation of both was supposed to be judged. In our task, the sentence order is particularly important since it implies chronology. The first is the hypothesis sentence, whose validity is to be assessed followed by the premise sentence aiming to provide new information useful for reasoning about the temporal state of the hypothesis. We set up the following classification labels for the hypothesis sentence: Supported: the hypothesis is still valid in view of the premise, Invalidated: the hypothesis ceased to be valid based on the premise content, and Unknown: premise does not provide any clue about validity of the hypothesis sentence. Note that, in traditional settings of NLI, the semantics of the labels are such that they evaluate whether the premise entails or contradicts the hypothesis. In TVR, on the other hand, the labels mean whether the premise supports (action still ongoing) or invalidates (action aborted or completed) the hypothesis, or the premise does not provide useful information about the temporal state of the hypothesis. As mentioned before, while there are significant similarities between TVR and NLI in the input format, number, and role of classes, the temporal and continuity-related character of the classes in TVR makes this task relatively distinct. In the later part of the paper, we explore the possibility of using NLI datasets for pretraining machine learning models for TVR task.
Considering our earlier example, we regard “Now waiting for a train to Munich, should be arriving here soon.” as a hypothesis since its validity is unknown, and “Again delay?, what’s wrong with DB trains these days?!” or “Finally! So, Goodbye Berlin!” as a premise.Footnote 2 Almquist and Jatowt [6] focused on another variant of temporal validity estimation problem (called also information expiry date estimation) in which a single sentence (hypothesis only) formed an input and the task was to decide the minimum length of a time period (usually represented as few validity duration classes) during which the hypothesis should remain valid. For example, in the above example “Now waiting for a train to Munich, should be arriving here soon.” we would assume a validity to be few minutes or few hours but rather not few days or weeks. Note that the two tasks can be combined so that both the elapsed time and the additional context can be used for judging whether the target sentence remains valid.
1.1 Contributions
Besides the introduction of a novel task, we also describe the knowledge-enhanced method and a dedicated dataset for the proposed task. Our approach incorporates information from a large knowledge graph (ATOMIC-2020 [7]) that holds commonsense knowledge since we believe that effective approach requires large-scale and deep knowledge about the world.
The proposed model combines the following two encoders: the first one represents commonsense knowledge from ATOMIC-2020 via pre-training and the second one encodes commonsense knowledge from the training dataset we use. Both encoders jointly reason with the commonsense knowledge in each input pair of sentences. The dataset that we construct and provide contains over 10k pairs of sentences describing concrete actions with the labels denoting temporal validity of hypotheses. We have released both the dataset and the associated code for our approach.Footnote 3
The remainder of this paper is composed of the following parts. In Section 2, we survey the related work while we formally describe the task and introduce our approach to solving it in Section 3. We explain the way in which we constructed dataset in Section 4, and we also describe experimental settings as well as discuss the experimental results in the same section. Section 6 provides the overview of the applications of our proposed task, while the last section concludes the paper.
2 Related work
We discuss the related work starting with a background on temporal commonsense reasoning and a natural language inference. We later compare our task with the related ones, and overview also similar datasets and related tasks.
2.1 Temporal commonsense reasoning
Several tasks in Natural Language Processing and Information Retrieval domains consider time as key aspect of text [8,9,10,11,12,13,14,15,16], including temporal understanding of stories [17, 18], temporal relation extraction [19, 20], temporal question answering [21, 22], and so on. Many of such works make use of temporal expressions embedded in text or of document timestamps (i.e., publication dates) [5, 23].
A schematic comparison of TVR task (right) with Temporal Validity Duration Prediction (TVDP) introduced in [6] (left) on the case of microblog posts. Post A indicates here a hypothesis sentence while post B is a premise
Implicit information that humans commonly know is addressed in, what is called, Commonsense Reasoning domain [24]. Winograd Schema Challenge [25] was one of the earliest challenges for machines in this regard, and many other challenges and approaches have also been proposed [26,27,28,29,30,31,32]. Temporal Commonsense is one of them, in which temporal challenges are addressed [33]. Zhou et al. [34] focused on comparing actions such as “going on a vacation” with others like “going for a walk” to assess which take longer, and constructed a dataset for question-answering including this kind of estimation. In particular, the subset of that dataset that relates to stationarity is relevant to our work. We further compare our task with other related ones in Sect. 2.5.
White and Awadallah [35] estimated the duration of tasks assigned by users in calendars. Takemura and Tajima [36] classified microblog posts to different lifetimes based on features specific to Twitter such as number of followers or presence of URLs. Almquist and Jatowt [6] examined the validity of sentences considering the time elapsed since their creation (more in Sect. 2.5). They introduced a novel task of estimating the validity period of a sentence and constructed a dedicated dataset composed of sentences from Wikipedia, news and social media. The most recent extension of the temporal validity research involves determination of the direction of change in temporal validity (increase, decrease or neutral) [37].
To probe models’ common sense in regards to temporal relations and reasoning, TimeDial [38] and MC-TACO [34] datasets were established embracing a diverse array of situations and types of temporal information. We overview common datasets related to temporal commonsense reasoning in Section 2.5. More information about datasets is also provided in the recent survey [39]. Jain et al. [40] has analyzed the performance of diverse large language models (LLMs) across different temporal commonsense reasoning tasks, finding that such reasoning still poses significant challenges for LLMs.
Additionally, various developments in time-aware training and representation strategies for language models have also been proposed recently [41,42,43,44] including temporal reasoning approaches based on knowledge graphs [45, 46]. Overall, contemporary research has exhibited a notable expansion in temporal reasoning studies in natural language understanding.
2.2 Natural Language Inference
Recently, Natural Language Understanding (NLU) by computers has attracted a lot of researchers’ attention. Natural Language Inference (NLI) or Recognizing Textual Entailment is one of NLU domains, in which computers deal with input in the form of two sentences [3], similar to our proposed task. NLI problems require to determine that a premise sentence entails, contradicts, or is neutral to a hypothesis sentence (or in some settings, entails vs. not entails). In the early stages of NLI research, Dagan et al. [47] constructed a relatively small dataset. The first largely annotated dataset was Stanford Natural Language Inference (SNLI) dataset [48], which was annotated through crowdsourcing. After that, many NLI datasets [49, 50], including Multi-genre Natural Language Inference (MNLI) [51] and Scitail [52], have been constructed. Vashishtha et al. [53] also converted existing datasets for temporal reasoning into NLI format, pointing out that there was no NLI dataset dedicated to temporal reasoning. Their task focuses on explicit temporal description while our task tackles implicit information. The emergence of many large datasets made it possible to train more complex models [3, 54]. Remarkably, pre-trained models such as BERT [55] and RoBERTa [56] demonstrated significant performance on NLI datasets, and were also used to train multi-task models [57].
2.3 Incorporation of knowledge bases
Generally, NLU works make use of Knowledge Graphs (KG) or Knowledge Bases (KB) to improve model performance [58,59,60]. Especially, commonsense reasoning works commonly incorporate knowledge from large KBs such as ConceptNet [61, 62] and WikiData [63] in their architectures [64,65,66]. However, only a few works in NLI attempt to incorporate KGs into computational models [67, 68]. Wang et al. [69], for example, improve performance on Scitail using knowledge borrowed from ConceptNet.
2.4 Comparison with related tasks
Similar to NLI, our work addresses a text classification problem, in which two sentences form an input. However, we focus on neither entailment nor contradiction but on the validity of sentences (see Tables 1 and 2 for comparison).
The NLI dataset constructed by Vashishtha et al. [53] includes temporal phenomena. However, their task addresses explicit descriptions of temporal relations such as duration and order, while we focus on implicit temporal information that is latent in sentences, similar to [34]. The problem that the task deals with is reasoning about event duration, ordering, and frequency in a separate manner. However, our approach requires a more comprehensive understanding of temporal phenomena through a contrastive type inference. Also, their task is posed as a question-answering problem while ours is formalized as an NLI type problem. Almquist and Jatowt [6] also worked on the validity of sentences. Unlike their work, we use premises as the additional source, instead of the information on the elapsed time from sentence creation as in [6], since, in many practical situations, additional text is available (e.g., sequences of tweets posted by the same user, or following sentences in a story or novel). Another recently proposed task, called Temporal Validity Change Prediction (TVCP) [37], is similar to TVR, however, it requires inference whether the temporal validity has increased, decreased or rather remained on same level given additional context. Figure 1 shows the comparison of our task with the one proposed in [6]. Table 4 also compares TVR task with the most related ones.
2.5 Similar datasets
There are several temporal commonsense reasoning datasets available. In the following, we list the main ones and summarize them in Table 3:
MC-TACO [34]: Given a context, a question, and a candidate response, the objective is to determine whether the candidate answer is “yes” (plausible) or “no” (implausible). The dataset focuses on assessing the plausibility of the answer within the temporal context provided.
TimeDial [38]: Dataset of a multiple-choice cloze task featuring over 1.1K carefully curated dialogues. The dialogues require an understanding of temporal commonsense concepts interwoven with the presented events.
WikiHow [70]: Given a goal and a number of steps, a system has to determine if the steps are in the correct temporal order.
BIG-bench [71]: Provided with a sequence of finished events, each with its defined timeframe, the model needs to determine when an individual might have been available for an unscheduled activity. While both BIG-bench and WikiHow encompass various other reasoning tasks, we specifically focus in this work only on temporal reasoning.
TimeQA [72]: This dataset comprises a series of time-sensitive question-answer pairs. Answering these questions involves understanding and reasoning within a longer context that requires temporal comprehension.
Note that for the case of TVR, the model has to ascertain the validity of textual content by using additional associated content as corroborating evidence. Our dataset is then complementary to the above-listed ones.
We also note that the above-listed datasets actually cover most of the temporal commonsense reasoning styles according to the categorization proposed by Zhou et al. [34]:
Event duration (ED): reasoning about event durations.
Event ordering (EO): reasoning about the typical sequence of events.
Frequency (F): reasoning about the frequency of event occurrences.
Stationarity (S): reasoning about the length of state persistence.
Typical time (TT): reasoning about the specific timing of events.
Table 3, besides summarizing the datasets, provides also the information on the types of their temporal commonsense reasoning (cf. the last column), and the characteristics of their tasks, the format of the output, and the evaluation metrics applied. TVR task requires the combination of the information related to Stationarity, Event Duration, as well as Event Ordering types. It also directly considers the notion of changes in the information validity and obsoleteness.
3 Proposed method
3.1 Task definition
We first provide the definition of our task. Let \(p = (s_1, s_2)\) be a pair of sentences where \(s_1\) and \(s_2\) are a hypothesis and a premise sentence, respectively. The sentences are in temporal order \(t_{s_1} \le t_{s_2}\) where \(t_{s_{id}}\) \((id=1,2)\) is the creation time, or a reading time of a sentence \(s_{id}\) (e.g., in the case of receiving microblog posts issued by the user, or when reading subsequent sentences of a story or a novel). The task is to assign one of the following three validity classes to \(s_1\) through the inference on \(s_1\) based on the content of \(s_2\):
The semantics of the classes are as follows:
-
Supported: class means that \(s_1\) remains still valid at \(t_{s_2}\) given information in \(s_2\).
-
Invalidated: class means that \(s_1\) ceased to be valid at \(t_{s_2}\) in view of \(s_2\).
-
Unknown: class indicates that the situational evidence is not conclusive or clear, and nothing can be said regarding the validity of the hypothesis (hence it can neither be supported nor invalidated).
As mentioned earlier, while we focus on the case of individual sentences, larger text chunks such as paragraphs could be considered instead. This, however, would pose more complexities since multiple actions with various inter-relations might be expressed in longer text portions.
We note that the order that \(s_1\) is followed by \(s_2\), either as represented by the order of their creation/posting dates (e.g., in case of Twitter) or by a sentence order in text (e.g., a story or a novel) may not always be necessary in reality. Although such an order is most natural and common, there might be the cases of retrospection in narratives, or swapped order of creating/sending messages. We leave the investigation of such cases for future work.
Finally, we note that, for simplicity, we do not consider in this paper hypothesis sentences expressing future or past actions but only ones that describe ongoing actions. The case of sentences about the past is rather trivial (e.g., "WWII started in 1939 with the attack on Poland") since they are in general valid. On the other hand, the case of sentences about future such as ones expressing plans, forecasts and expectations is somewhat difficult to evaluate, hence, in the current experiments, we only focus on present actions and states.
3.2 Methodology
Given the requirement for temporal commonsense reasoning when judging the validity, we believe that it makes sense to incorporate external knowledge about common human actions and their temporal aspects. Consequently, we first discuss relevant knowledge bases that could provide information about temporal properties of common user actions. Then, we propose a new neural network-based architecture that combines two encoders. The first encoder utilizes the information drawn from the knowledge base while the second encoder, the text encoder, is based only on the text data. The combined output of the encoder utilizing the knowledge base and the text encoder is then used as input to the softmax classifier. Figure 2 illustrates the architecture of our model.
3.2.1 Encoding knowledge
As mentioned before, one of the components of our model is the knowledge encoder. It is however necessary to first select a suitable knowledge base for making it effective for our task. We then describe our choice of a knowledge base and discuss how we encode its knowledge.
Several different knowledge bases (KBs) could be useful to achieve our goal. We have considered the following ones: FrameNet [75], WikiHow [76], Howto100m [77], and VerbNet [78]. We, however, concluded that ATOMIC-2020 (An ATlas Of MachIne Commonsense) [7] would be the most suitable KB for our purpose thanks to its relatively large scale (1.33M commonsense knowledge tuples and 23 commonsense relations) as well as because it contains temporal commonsense relations.
ATOMIC [79] is the predecessor KB of ATOMIC-2020 designed for commonsense reasoning, which contains nine different if-then relations such as Cause, Effect, Intention, Reaction, and so on. Most of the entities in ATOMIC are expressed as short text or phrases. ATOMIC-2020 is the subsequent version of ATOMIC which incorporates new relations between events such as “IsAfter”, “IsBefore”, “HasSubevent”, etc. For example, “PersonX pays PersonY a compliment” and “PersonX will want to chat with PersonY” are sentences belonging to the if-then relation in Atomic-2020, while “PersonX bakes bread” and “PersonX needed to buy ingredients” is an example of a pair of sentences connected by the “IsAfter” relation.
3.3 TransE
We adapt TransE [80] model to represent commonsense relations between events. In the following, we briefly explain the idea behind this adaptation starting with the explanation of TransE itself. TransE is a model for learning embeddings of KBs represented in the triple form involving entities and their relations which are represented as follows: [head entity, relation, tail entity].
In TransE, the relations are addressed as translations in the vector space. TransE learns embeddings using a loss function being an operation involving entities and their relations (similar to skip-gram [81]) so that the following is sought to be preserved: head entity + relation = tail entity:
where \([x]_+\) denotes the positive part of x, \(\gamma \) is a margin parameter, and d is the distance function. \(\textbf{h}\), \(\textbf{l}\), and \(\textbf{t}\) are the embeddings of head entity, relation label, and tail entity, respectively. S is a set of positive data instances, while \(S'\) is the set of negative ones.
3.4 Adapting TransE for ATOMIC-2020 sentences
As mentioned before, the entities in the ATOMIC-2020 dataset are represented as short phrases or sentences. In the following, we show two examples of relations and their head and tail entitiesFootnote 4:
-
x get x’s car repaired, happens before, person spent a fortune
-
x runs out of steam, is after, x exercises in the gym
In the case of original TransE method, the same entities were assumed to occur multiple times in the knowledge base. On the other hand, in the case of ATOMIC-2020 entities, the number of potential entities is quite large as there can be multiple diverse phrases used to represent arbitrary human actions. Also, it is rather rare that the same phrase appears during the inference (or testing) time as the ones used for training.
To solve this problem, we adapt the TransE model to text chunks instead of just entities as originally used. Figure 3 shows the detailed structure of our model. First, we compute a sentence vector corresponding to each phrase in the KG using Sentence-BERT (SBERT) [82]. Then, we train the weights W for the sentence vectors and the relation embedding \(E_r\) using Margin Based Ranking Loss as in TransE. The weights of SBERT are fixed and not trained. Since our task relies on temporal commonsense reasoning, we select only “IsAfter” and “IsBefore” relations from ATOMIC-2020 for our model.
After the pre-training with TransE is completed, we construct an encoder for the downstream task using the embeddings of the TransE model that were pre-trained on ATOMIC-2020 knowledge base. In the encoder, the output is the concatenation of the embeddings of the hypothesis and the ones of the premise sentence.
3.5 Other knowledge embedding models
While TransE may probably be the most often used translation model, several newer variants are also available. In our experiments, we will then also test other variants of translating embedding models like TransH [83] and ComplEx [84] in place of TransE. Both of the models are actually the extensions of TransE.
TransH extends TransE by applying the translation from head to tail entity in a relational-specific hyperplane. This is done to address inability of TransE to model one-to-many, many-to-one, and many-to-many relations. When applying TransH, we use the same knowledge embedding model as TransE except for incorporating an additional module being applied for projection. In order to project each relation onto the hyperplane, we use a relation-specific projection matrix, same as it is done in the original TransH model.
ComplEx model is based on combining complex-valued entity and relation representations. When applying ComplEx model, we add a linear layer after sentence embedding so that the model has two different parallel linear layers to transform sentence embeddings, where one represents real part, and the other is for imaginary part.
3.5.1 Combined model
As shown in Fig. 2, our final model consists of the combination of text encoder and knowledge encoder, together with the classification layer on top of them. As the dimensions of the pre-trained knowledge embeddings and the output of the text encoder differ, we linearly transform them to make the sizes of the embeddings equal. We then apply the concatenation, calculation, and element-wise product for combining both embedding vectors:
where \(\mathbf {H_t}\) is the output of the text encoder, \(\mathbf {H_k}\) is the output of knowledge encoder, and \(\odot \) denotes the operation of element-wise multiplication. Finally, the obtained output is linearly transformed, and fed into a softmax classifier which is tasked with deciding the validity class.
3.5.2 Knowledge-encoder only
For comparison, in our experiments, we also test knowledge-encoder only version of the model. In this model variant, we exclude textual encoder and the comparison layer from the combined version. In this case, the linear layer that used for dimension adjustment is not necessary. However, we still keep this layer for the purpose of comparison.
4 Dataset
We construct a new dataset in order to evaluate our method designed for the proposed task. As mentioned earlier, each data instance should be composed of a pair of a hypothesis and premise sentences together with a label denoting the validity of the hypothesis. This formulation bears strong resemblance to Natural Language Inference problem in NLP also known as Text Entailment detection.
To begin with, we need seed sentences for which we could create corresponding sentences that would fall into one of the three validity classes. We decided to randomly select 5000 premise sentences from SNLI datasetFootnote 5 and use them as the hypothesis sentences. The sentences in SNLI were originally the caption sentences of images being part of the Flickr30k corpus [85].
Since we wanted topically balanced dataset, we first clustered all the collected sentences using BERT [55] for sentence representation and k-means (with \(k = 100\)) as the clustering algorithm. Then, we sampled an equal number of sentences from each cluster (in particular, we extracted randomly up to 50 sentences per cluster) to maintain high variation of sentence meaning.
Having the hypothesis sentences, we next created their premise sentences as well as class labels using crowdsourcing. We used the Amazon Mechanical TurkFootnote 6 as the crowdsourcing platform. In particular, for each hypothesis, we asked two crowdworkers to create a sentence corresponding to each class label. In order to prevent copying, the candidate premise sentences were accepted only when 40% or more words were not in the corresponding hypothesis sentence. This was done by tokenization and the calculation of word overlap between hypotheses and premises. The similarity checking step was necessary as some crowdworkers tried to create sentences with minimal effort (e.g., by modifying a single word or a few words). Such trivial sentences were especially likely to be made for the supported class.
During making premise sentences, we also asked participants to provide a description of the estimated time during which the hypotheses sentences could have been actually valid. The reason for this was to make sure that the workers carefully consider temporality of text when creating content. This information was, however, not included in the final dataset. In total, about 400 workers participated in the dataset creation. As we found out some noisy sentences, we later went through over all the sentences in the dataset to verify their quality. Whenever necessary, grammar was manually corrected while some poor-quality, noisy, offensive, or overly-personal sentences were removed.
In total, we removed 19,341 pairs of sentences. Our verification has been done by checking: if the written sentence contains anything offensive or personal, whether the sentence copies the corresponding hypothesis, if the sentence matches corresponding hypothesis, if the sentence contains any grammatical errors, if the sentence contains any spelling errors and so on. The final dataset includes 10,659 sentence pairs. The number of sentence pairs for each class is same (3553 sentences). Table 1 shows a few examples of the generated data. For comparison, we also present example sentences of the NLI task in Table 2.
In Table 5, we show the average number of words across different classes. The previous research has indicated that the number of words in sentences in some NLI datasets varies significantly depending on their labels [52] which may adversely impact the work of classifiers. As it can be seen, the average number of words in our dataset is almost the same for different labels, although the variance is bit high for the unknown class. It is also interesting that premise sentences are on average shorter than the hypothesis sentences.
5 Experiments
5.1 Experimental settings
In our experiments, all compared models undergo 5-fold non-nested cross-validation. The batch size is 16, and the learning rate is determined by the performance on the validation fold selected from among 0.005, 0.0005, and 0.00005. The optimal value for all models was 0.00005. Accuracy, the proportion of correct responses, is the most relevant and extensively used metric for NLI tasks and is used to evaluate our approach, too.
Besides ours, we also test the following models:
-
BERT (bert-base-uncased) [55],
-
Siamese Network [86],
-
SBERT [82] Embeddings with Feedforward Network,
-
Self-Explaining Model [87].
Additionally, we also include the results of GPT 3.5 and Llama 7B [88] in few shot settings (three examples) as originally reported in [40].
We set an architecture of Siamese Network similar to the one used in Bowman et al. [48]’s work which utilizes the 8B version of GloVe embeddings [89] and multiple tanh layers. The number of trainable parameters is 240,181,403.
SBERT is a fine-tuned model of BERT, RoBERTa and their variants designed to calculate semantic similarity between two sentences. While BERT has demonstrated state-of-the-art performance on a lot of downstream tasks, it was unclear how BERT can encode sentences. In addition, calculation of similarity as a two-input regression task requires longer time. SBERT has been proposed to solve these problems. In our experiments, we use 3 hidden layers each with 500 dimensions, ReLU activation, and dropout rates of 75%. The output layer is based on a softmax classifier. The number of trainable parameters for this model is 1,271,003.
The last tested approach, the Self-Explaining [87] is a state-of-the-art model for SNLI dataset and it has 127,008,769 trainable parameters. Self-Explaining model is equipped with an attention-like Self-Explaining layer composed of three layers: Span Infor Collecting (SIC) layer, Interpretation layer, and an output layer. The Self-Explaining layer is placed on top of a text encoder which is RoBERTa-base [56], same as in the original version of the model. Robustly optimized BERT approach (RoBERTa) is a BERT-based model using modified pre-training data and optimized hyperparameters. RoBERTa model is known to achieve significant improvement in the performance compared to BERT [90] for downstream tasks.
The SIC layer outputs the embedding for each word span. Suppose that a sentence consists of words \(w_1,..., w_K\). A word span is a sequence of words \(w_i, w_{i+1},..., w_j\) within the interval \([i, j] (1 \le i < j \le K)\). We denote the interval embedding \(\mathbf {s_{ij}}\). Here, the encoder outputs the word embeddings \(\mathbf {e_1},..., \mathbf {e_K}\) of the encoded words, and using this as input, the interval embedding is \(\mathbf {h_{ij}}=F(\mathbf {e_i}, \mathbf {e_j})\) where F is an arbitrary function. The final output of the SIC layer is \(\textbf{H} = (\mathbf {h_{12}}; \mathbf {h_{13}};...; \mathbf {h_{i-1j}})\)where “;” denotes concatenation. The Interpretation layer aggregates the output of the SIC layer, by first calculating the importance of each span embedding:
The weighted average of these values forms the output of this layer:
\(\varvec{\hat{h}}= \sum _{i \in [1,K], j \in [i,K]}\alpha (i, j)\textbf{h}(i j).\) The final output of the Self-Explaining model is the output of the softmax classifier, and the loss for training the model is defined as: \(L=\log p(y|\textbf{x}) + \lambda \sum \limits _{i,j}\alpha ^2(i, j).\)
For testing our proposed architecture, we experiment with the two types of contextual encoders: Siamese Network and Self-Explaining model. The dimensionality of the entity embeddings was set to 256, but the combined embeddings are linearly transformed to 128 dimensions in order to match the dimensionality of each encoder.
We implemented all the models using PyTorch [91] with HuggingFace’s transformers [92] on a machine equipped with GPU.
5.2 Experiments with NLI pre-training
We first experiment with NLI datasets given the similarity between NLI and our task as well as the availability of large-scale datasets for Natural Language Inference. Usually, it has been established that pre-training with NLI datasets improves accuracy in many downstream tasks [93]. Given the similarity and relevance of NLI for our task, we decided to first analyze if pre-training selected models using the NLI datasets and then fine-tuning them on the proposed task could help boost the accuracy of results. For this experiment, we used the training sets of SNLI 1.0 and MNLI 0.9 datasets including 550,152 and 392,702 instances, respectively. We also needed to establish a mapping between NLI classes and the three classes of our task. We decided for the following correspondence:
-
Supported = entailment class of NLI,
-
Invalidated = contradiction class of NLI,
-
Unknown to neutral = class of NLI.
We believe that this class mapping is most suitable based on semantics of each class. Table 6 presents the results obtained by NLI pre-training. The results indicate that the NLI data has certain relevance and usefulness for our task as it improves the results for Siamese network. However, the results obtained for the Self-Explaining model were not improved, likely because this model is an already pre-trained one. Interestingly, quite a large drop in accuracy occurs when using the MNLI dataset, although the actual reason for this remains unclear. Nevertheless, the results indicate that NLI datasets contain information that could be to some degree useful for our task, especially for relatively simpler models like Siamese Nets.
5.3 Main results
We show the main results in Table 7. According to the results, we observe that the Self-Explaining model achieves the best accuracy. BERT provides the worst results, likely, because of unstable training of BERT as also pointed out in [55]. TVR task seems also challenging for large language models; the LLMs we use do not perform well on this task, even in few-shot setting.
Another notable point is that incorporating commonsense knowledge from external knowledge base improves the accuracy in both the Siamese and Self-Explaining models. However, the improvement is smallest for Self-Explaining with RoBERTa as the pretraining module. RoBERTa has been trained on larger amount of data than BERT and it was subject to careful optimization.
Nevertheless, we observe that the improvement when incorporating our solution based on TransE is relatively good. The results rise from 0.715 to 0.784 (9.6% improvement) for Siamese Net and from 0.805 to 0.819 (1.7% improvement) for the Self-Explaining model. In general, we conclude that adding commonsense data in the proposed way is a promising direction for the proposed task. We believe that exploration of more sophisticated commonsense reasoning approaches and more comprehensive datasets could further improve the results. We later explore the dataset extension to see if we could further boost the accuracy in simple and inexpensive ways.
Our conclusions are also supported by the confusion matrices for the Siamese model as shown in Fig. 4. Incorporating TransE results in more correct determination of the supported and unknown classes (improvement by 28% and 6.5%, respectively). On the other hand, it only slightly confuses the invalidated class (the decrease of 1.4%).
In Table 8, we also show the examples of data instances misclassified by our approach.
Confusion matrices for TVR task of Siamese network (left) and Siamese network with TransE (right). The horizontal axis corresponds to the prediction (\(x^P\)) and vertical one to gold labels (\(x^G\)). The left (upper) blocks are invalidated, middle ones are supported, and right (bottom) ones are unknown
5.4 Testing different knowledge types
We next investigate other ATOMIC-2020 relationships to determine if they would be applicable to our task. ATOMIC-2020 incorporates three kinds of fundamental relations: physical-entity, social-interaction, and event-centered. We then utilize all event-centered relationships including isAfter, isBefore that have been already used, as well as the newly added relations of HasSubEvent, HinderedBy, Causes, xReason, and isFilledBy. Due to the fact that both our objective and dataset require reasoning about the relationships between two events, we take the previously mentioned subset of ATOMIC-2020 data as relevant and valuable for our particular task.
Table 9 shows the results of TransE trained with the aforementioned data using the Self-Explaning model and Siamese models. When looking at the results and comparing with the previous results, we observe that the incorporation of additional knowledge into the Siamese Network enhances the accuracy of our task improving the results by 1.8% (the change from 0.784 to 0.798). On the other hand, the Self-Explaining model’s accuracy decreases by 4.6% (the decrease from 0.819 to 0.783). This would again suggest that the performance of simpler models which are not pretrained can be enhanced by incorporating more data, while for more complex methods additional relations about events are likely to confuse the model.
5.5 Augmenting dataset
We have also tried to automatically augment our dataset as it contains relatively small number of data instances. Data augmentation is a machine learning technique that adds new instances to a dataset at hand, and these instances are typically generated automatically by synonym replacement, back-translation, text shortening or extension, or by other approaches. We adopted back-translation in this study. In particular, we have used facebook/wmt19-en-de and facebook/wmt19-de-en [94] as translation models, which means that the sentences were translated from English to German and then backwards from German to English. We then added the newly generated additional data to the training set resulting in incorporation of over 12k new sentence pairs.
Table 10 shows the results of the experiments when using the augmented dataset. In total, we used 18k sentences for training (the combined original training data and the new training data). According to the table, we observe that for the Siamese Network, the accuracy is slightly better than the one of the original results (0.715 in Table 7), however, that of Self-Explaining model is slightly worse than before (0.873 in Table 7). Overall, as we observe the results do not undergo a large improvement in terms of the accuracy or sometimes even cause drop in accuracy. This is likely because the augmented data are still too similar to the original dataset, so that models likely cannot learn much from it. We believe that more manually annotated data could further help the performance, or other data augmentation techniques could be experimented with. We will address this in our future work.
5.6 Combination methods of knowledge embedding
We next experiment with different combination methods of signals obtained from processing premise and hypothesis sentences. We consider a simple concatenation as a baseline method to produce combined embedding of hypothesis and premise. This makes both the embedding models and upper layers to be rather simple. Since upper layers are relatively simple, we believe that it might be difficult to effectively process complex embedding. Hence, we explore other combination variants, such as subtraction and the combination of concatenation, subtraction, element-wise multiplication in knowledge-encoder only setting. In this experiment, the concatenation method is considered as a baseline.
Table 11 shows the results of our exploration of the different ways to produce knowledge embedding when using the proposed model. The results show that as the amount of information increases, the performance gets improved. Hence, we then also test the version with the Self-Explaining model equipped with concatenation, subtraction, and multiplication. Table 12 shows this result. Unlike the proposed model, the accuracy of the combined concatenation, subtraction, and multiplication is, however, worse than that of concatenation. Thus, different ways of combining of the output of Self-Explaining model and the one of knowledge embedding can be too complex to process for upper layers of the Self-Explaining model.
5.7 Testing different knowledge embedding approaches
Finally, we explore different approaches of translating embedding models that could be used as substitution of TransE. Table 13 shows the results of TransE variants combined with the Self-Explaining model. According to the table, the loss during the pre-training does not go down in TransH [83] and ComplEx [84]. As the loss remains high, the accuracy with the proposed downstream task is lower, indicating that the proposed architecture requires simpler ways to construct the knowledge-based embeddings for TNLI. TransH and ComplEx are more complex models than TransE and as the results show that they negatively impact the accuracy. Another possibility is that the models that we tested were not supplied with sufficiently large knowledge to properly benefit from their more complex architectures.
6 Applications
This section discusses some applications of the proposed task, and the proposed model.
Support for story understanding and event extraction: Effective methods trained for the proposed task can lead to better comprehension of textual narratives and, potentially, more accurate event extraction [95]. Given the evidence provided in the following sentences, a component that can reason about action completion would, for example, enhance the reading comprehension of stories. We observe that such knowledge is frequently stated in implicit way.
Recommendation of microblog posts: Microblog entries can remain valid for different length of time. In the era of information overload, users generally would wish to select and read only valid messages from a large number of postings as such messages are typically the most relevant and significant. The information overload could be reduced if users had the option to filter out invalid posts from their timelines, or if posts were ranked based on a number of factors, including their anticipated temporal validity.
User tracking and analysis: User tracking and analysis in social networks services (SNSs) [8, 96, 97] can be supported by temporal processing of user’s posts so that the user’s current situation and action can be known at each time point. Temporally targeted ads can become then possible. Additionally, in emergency situations it would be easier to confirm user whereabouts and safety.
Chatbots: Finally, chatbots and AI assistants [98] could be equipped with means for better understanding user contexts, plans, required actions and stories by users to maintain better communication with humans.
7 Conclusion and future work
There are still numerous challenges in computational processing of implicit temporal information in natural language. In this work, we propose a novel task for reasoning about the temporal validity of sentences based on a follow-up additional context, and also design as well as train a new model with an embedded knowledge base for this task. Inspired by the observation that humans can evaluate validity of text based on their temporal commonsense, we aim at equipping machines with the same ability.
We formally define the task, and construct a dedicated dataset for it, along with selecting a suitable dataset and proposing a new method for embedding sentences within knowledge bases using machine learning.
We believe that our work can contribute to better comprehension of how to rely on knowledge bases containing sentences as entities and enhance the precision of TVR task or other types of temporal commonsense reasoning. We have also conducted experiments with NLI data as well as with data augmentation technique to determine if they can be utilized for the proposed task.
In future work, we plan to extend the dataset both through more extensive manual annotation as well as by applying more effective data augmentation approaches. The latter can be obtained through using Large Language Models for paraphrasing the existing sentences as well as by prompting the models to create more instances of similar sentences. Naturally, such data augmentation needs to be carried together with strict manual checks, same as in the case of crowdsourced data creation. The new dataset not only should be in the form of sentence pairs but also should contain larger chunks of text such as entire paragraphs. Another future direction is to extend the task itself. The timestamp of the premise and hypothesis sentences could be used as an additional signal to identify the validity of a hypothesis sentence [6] in addition to judgments based on the content of premise sentences. This would lead to a more general task and more potential applications.
Data availability
The dataset is available at: https://tinyurl.com/T-NLI
Notes
Sect. 2.5 provides more details on similarities and differences with NLI.
Note that it is not always easy to determine the correct answer as the context or necessary details might be missing, and in such cases same as in the case of humans machines should rely on probabilistic reasoning.
Examples taken from [7].
SNLI dataset is licensed under CC-BY-SA 4.0.
References
Torfi A, Shirvani RA, Keneshloo Y, Tavaf N, Fox EA. Natural language processing advancements by deep learning: a survey. arXiv preprint. 2020. arXiv:2003.01200.
Storks S, Gao Q, Chai JY. Commonsense reasoning for natural language understanding: a survey of benchmarks, resources, and approaches. arXiv preprint. 2019;1–60. arXiv:1904.01172.
Storks S, Gao Q, Chai JY. Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv preprint. 2019. arXiv:1904.01172.
Hosokawa T, Jatowt A, Sugiyama K. Temporal natural language inference: evidence-based evaluation of temporal text validity. In: Kamps J, Goeuriot L, Crestani F, Maistro M, Joho H, Davis B, Gurrin C, Kruschwitz U, Caputo A, editors. Advances in information retrieval—45th European conference on information retrieval, vol. Proceedings, Part I. Lecture Notes in Computer Science, volume 13980. Dublin: ECIR; 2023. p. 441–58. https://doi.org/10.1007/978-3-031-28244-7_28.
Campos R, Dias G, Jorge AM, Jatowt A. Survey of temporal information retrieval and related applications. ACM Comput Surv (CSUR). 2014;47(2):1–41.
Almquist A, Jatowt A. Towards Content Expiry Date Determination: Predicting Validity Periods of Sentences. In: Proceedings of the 41st European conference on IR research (ECIR ’19) 2019. p. 86–101.
Hwang JD, Bhagavatula C, Le Bras R, Da J, Sakaguchi K, Bosselut A, Choi Y. (Comet-) atomic 2020: on symbolic and neural commonsense knowledge graphs. In: Proceedings of the 34th AAAI conference on artificial intelligence (AAAI-21). 2021. p. 6384–6392 .
Abe S, Shirakawa M, Nakamura T, Hara T, Ikeda K, Hoashi K. Predicting the Occurrence of Life Events from User’s Tweet History. In: Proceedings of the 12th IEEE International conference on semantic computing (ICSC ’18). 2018. p. 219–226.
Kanazawa K, Jatowt A, Tanaka K. Improving retrieval of future-related information in text collections. In: Proceedings of the 2011 IEEE/WIC/ACM International conference on web intelligence (WI ’11). 2011. p. 278–283.
Minard A-L, Speranza M, Agirre E, Aldabe I, Erp M, Magnini B, Rigau G, Urizar R. SemEval-2015 task 4: timeline: cross-document event ordering. In: Proceedings of the 9th International workshop on semantic evaluation (SemEval 2015). Denver: Association for Computational Linguistics; 2015. p. 778–786. https://doi.org/10.18653/v1/S15-2132.
Cheng F, Miyao Y. Predicting event time by classifying sub-level temporal relations induced from a unified representation of time anchors. arXiv preprint. 2020. arXiv:2008.06452.
Jatowt A, Antoine É, Kawai Y, Akiyama T. Mapping temporal horizons: analysis of collective future and past related attention in Twitter. In: Proceedings of the 24th International conference on world wide web (WWW ’15), 2015. p. 484–494.
Ning Q, Wu H, Roth D. A multi-axis annotation scheme for event temporal relations. In: Proceedings of the 56th Annual meeting of the association for computational linguistics, volume 1: long papers. Melbourne: Association for Computational Linguistics; 2018. p. 1318–1328. https://doi.org/10.18653/v1/P18-1122.
Yamamoto Y, Tezuka T, Jatowt A, Tanaka K. Supporting judgment of fact trustworthiness considering temporal and sentimental aspects. In: Web Information Systems Engineering-WISE 2008: 9th International Conference, Proceedings 9. Auckland: Springer; 2008. p. 206–220.
Kawai H, Jatowt A, Tanaka K, Kunieda K, Yamada K.Chronoseeker: search engine for future and past events. In: Proceedings of the 4th International Conference on Uniquitous Information Management and Communication. 2010. p. 1–10.
Allein L, Augenstein I, Moens M-F. Time-aware evidence ranking for fact-checking. J Web Semant. 2021;71: 100663.
Han R, Liang M, Alhafni B, Peng N. Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding. arXiv preprint. 2019. arXiv:1904.11942.
Santana BS, Campos R, Amorim E, Jorge A, Silvano P, Nunes S. A survey on narrative extraction from textual data. Artif Intell Rev. 2023;56(8):8393–435. https://doi.org/10.1007/S10462-022-10338-7.
Vashishtha S, Van Durme B, White AS. Fine-grained temporal relation extraction. In: Proceedings of the 57th Annual meeting of the association for computational linguistics. Florence: Association for Computational Linguistic; 2019. p. 2906–2919. https://doi.org/10.18653/v1/P19-1280 .
Dligach D, Miller T, Lin C, Bethard S, Savova G. Neural temporal relation extraction. In: Proceedings of the 15th Conference of the european chapter of the association for computational linguistics: volume 2, short papers. Valencia: Association for Computational Linguistics. 2017. p. 746–751 https://aclanthology.org/E17-2118.
Harabagiu S, Bejan CA. Question answering based on temporal inference. In: Proceedings of the AAAI-2005 workshop on inference for textual question answering. 2005. p. 27–34.
Jatowt A. Temporal question answering in news article collections. In: Companion of The web conference 2022. Lyon: Virtual Event. 2022. p. 895–895.
Kanhabua N, Anand A. Temporal information retrieval. In: Proceedings of the 39th International ACM SIGIR conference on research and development in information retrieval. 2016. p. 1235–1238.
Trinh TH, Le QV. A simple method for commonsense reasoning. arXiv preprint. 2018.arXiv:1806.02847.
Levesque H, Davis E, Morgenstern L. The winograd schema challenge. In: Proceedings of the 13th International conference on the principles of knowledge representation and reasoning (KR ’12). 2012. p. 552–561.
Rashkin H, Sap M, Allaway E, Smith NA, Choi Y. Event2Mind: commonsense inference on events, intents, and reactions. In: Proceedings of the 56th annual meeting of the association for computational linguistics, volume 1: long papers. Melbourne: Association for Computational Linguistics; 2018. p. 463–473. https://doi.org/10.18653/v1/P18-1043.
Luo Z, Sha Y, Zhu KQ, Hwang SW, Wang Z. Commonsense causal reasoning between short texts. In: Proceedings of the 15th International conference on the principles of knowledge representation and reasoning (KR ’16). 2016. p. 421–431.
Gao Q, Yang S, Chai J, Vanderwende L. What action causes this? Towards naive physical action-effect prediction. In: Proceedings of the 56th annual meeting of the association for computational linguistics, volume 1: long papers. Melbourne: Association for Computational Linguistics; 2018. p. 934–45. https://doi.org/10.18653/v1/P18-1086 .
Tamborrino A, Pellicanò N, Pannier B, Voitot P, Naudin L. Pre-training is (almost) all you need: An application to commonsense reasoning. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2020. p. 3878–87. https://doi.org/10.18653/v1/2020.acl-main.357.
Liu Q, Jiang H, Ling ZH, Zhu X, Wei S, Hu Y. Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems. In: Proceedings of the AAAI 2017 spring symposium on computational context: why it’s important, what it means, and can it be computed?. 2017. p. 315–321.
Mostafazadeh N, Chambers N, He X, Parikh D, Batra D, Vanderwende L, Kohli P, Allen J. A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. California: Association for Computational Linguistics. 2016. p. 839–49. https://doi.org/10.18653/v1/N16-1098.
Lin BY, Chen X, Chen J, Ren X. KagNet: Knowledge-aware graph networks for commonsense reasoning. In: Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics. 2019. p. 2829–39. https://doi.org/10.18653/v1/D19-1282.
Zhou B, Ning Q, Khashabi D, Roth D. Temporal common sense acquisition with minimal supervision. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2020. p. 7579–7589. https://doi.org/10.18653/v1/2020.acl-main.678.
Zhou B, Khashabi D, Ning Q, Roth D. "Going on a vacation” takes longer than "going for a walk”: a study of temporal commonsense understanding. In: Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th International joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics; 2019. p. 3363–9. https://doi.org/10.18653/v1/D19-1332.
White RW, Hassan Awadallah A. Task duration estimation. In: Proceedings of the 12th ACM International conference on web search and data mining (WSDM ’19). 2019. p. 636–44.
Takemura, H, Tajima K. Tweet classification based on their lifetime duration. In: Proceedings of the 21st ACM International conference on information and knowledge management (CIKM ’12). 2012. p. 2367–70.
Wenzel G, Jatowt A. Temporal validity change prediction. CoRR. 2024. https://doi.org/10.48550/ARXIV.2401.00779.
Qin L, Gupta A, Upadhyay S, He L, Choi Y, Faruqui M. TIMEDIAL: temporal commonsense reasoning in dialog. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, volume 1: long papers. Association for Computational Linguistics. 2021. p. 7066–7076. https://doi.org/10.18653/v1/2021.acl-long.549.
Wenzel G, Jatowt A. An overview of temporal commonsense reasoning and acquisition. CoRR. 2023. https://doi.org/10.48550/ARXIV.2308.00002.
Jain R, Sojitra D, Acharya A, Saha S, Jatowt A, Dandapat S. Do language models have a common sense regarding time? Revisiting temporal commonsense reasoning in the era of large language models. In: Bouamor H, Pino J, Bali K, editors. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: EMNLP; 2023. p. 6750–74. https://aclanthology.org/2023.emnlp-main.418.
Wang J, Jatowt A, Yoshikawa M, Cai Y. BiTimeBERT: extending pre-trained language representations with bi-temporal information. 2023.
Cole JR, Chaudhary A, Dhingra B, Talukdar P. Salient span masking for temporal understanding. 2023.
Kimura M, Kanashiro Pereira L, Kobayashi I. Towards a language model for temporal commonsense reasoning. In: Djabri S, Gimadi, D, Mihaylova T, Nikolova-Koleva I, editors. Proceedings of the student research workshop associated with RANLP. INCOMA Ltd. 2021. p. 78–84. https://aclanthology.org/2021.ranlp-srw.12.
Zhou B, Ning Q, Khashabi D, Roth D. Temporal common sense acquisition with minimal supervision. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL ’20). 2020. p. 7579–89.
Dhingra B, Cole JR, Eisenschlos JM, Gillick D, Eisenstein J, Cohen WW. Time-aware language models as temporal knowledge bases. Transact Assoc Comput Linguist. 2022;10:257–73. https://doi.org/10.1162/tacl_a_00459.
Jang J, Ye S, Lee C, Yang S, Shin J, Han J, Kim G, Seo M. TemporalWiki: a lifelong benchmark for training and evaluating ever-evolving language models. 2023.
Dagan I, Glickman O, Magnini B. The PASCAL recognising textual entailment challenge. In: Machine learning challenges workshop (MLCW ’05). 2005: p. 177–190.
Bowman SR, Angeli G, Potts C, Manning CD. A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Lisbon: Association for Computational Linguistics. 2015. p. 632–42. https://doi.org/10.18653/v1/D15-1075.
Demszky D, Guu K, Liang P. Transforming question answering datasets into natural language inference datasets. arXiv preprint. arXiv:1809.02922. 2018.
Glockner M, Shwartz V, Goldberg Y. Breaking NLI systems with sentences that require simple lexical inferences. In: Proceedings of the 56th annual meeting of the association for computational linguistics, volume 2: short papers. Melbourne: Association for Computational Linguistics. 2018. pp. 650–655. https://doi.org/10.18653/v1/P18-2103.
Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1, long papers. New Orleans: Association for Computational Linguistics. 2018. p. 1112–22. https://doi.org/10.18653/v1/N18-1101.
Khot T, Sabharwal A, Clark P. SciTaiL: a textual entailment dataset from science question answering. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18). 2018.
Vashishtha S, Poliak A, Lal YK, Van Durme B, White AS. Temporal reasoning in natural language inference. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics. 2020. p. 4070–4078. https://doi.org/10.18653/v1/2020.findings-emnlp.363.
Chen Q, Zhu X, Ling ZH, Wei S, Jiang H, Inkpen D. Enhanced LSTM for natural language inference. In: Proceedings of the 55th Annual meeting of the association for computational linguistics, volume 1: long papers. Vancouver: Association for Computational Linguistics. 2017. p. 1657–1668. https://doi.org/10.18653/v1/P17-1152.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1, long and short papers. Minneapolis: Association for Computational Linguistics. 2019. p. 4171–4186. https://doi.org/10.18653/v1/N19-1423.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach. arXiv preprint. 2019. arXiv:1907.11692.
Crawshaw M. Multi-task learning with deep neural networks: a survey. arXiv preprint. 2020. arXiv:2009.09796.
Clark P, Dalvi B, Tandon N. What Happened? Leveraging VerbNet to predict the effects of actions in procedural text. arXiv preprint. 2018.arXiv:1804.05435.
Mihaylov T, Frank A. Knowledgeable reader: Enhancing cloze-style reading comprehension with external commonsense knowledge. In: Proceedings of the 56th annual meeting of the association for computational linguistics, volume 1: long papers. Melbourne: Association for Computational Linguistics; 2018. p. 821–832. https://doi.org/10.18653/v1/P18-1076.
Yasunaga M, Ren H, Bosselut A, Liang P, Leskovec J. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In: Proceedings of the 2021 Conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics. 2021. p. 535–546. https://doi.org/10.18653/v1/2021.naacl-main.45.
Liu H, Singh P. ConceptNet—a practical commonsense reasoning tool-kit. BT Technol J. 2004;22(4):211–26.
Speer R, Chin J, Havasi C. ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the 31st AAAI conference on artificial intelligence (AAAI-17). 2017. p. 4444–4451.
Vrandečić D, Krötzsch M. Wikidata: a free collaborative knowledgebase. Commun ACM. 2014;57(10):78–85.
Peters ME, Neumann M, Logan R, Schwartz R, Joshi V, Singh S, Smith NA. Knowledge enhanced contextual word representations. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics. 2019. p. 43–54. https://doi.org/10.18653/v1/D19-1005.
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: Enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Florence: Association for Computational Linguistics. 2019. p. 1441–1451. https://doi.org/10.18653/v1/P19-1139.
Zhang T, Cai Z, Wang C, Li P, Li Y, Qiu M, Tang C, He X, Huang J. HORNET: enriching pre-trained language representations with heterogeneous knowledge sources. In: Proceedings of the 30th ACM international conference on information and knowledge management (CIKM ’21). 2021. p. 2608–17.
Chen Y, Huang S, Wang F, Cao J, Sun W, Wan X. Neural maximum subgraph parsing for cross-domain semantic dependency analysis. In: Proceedings of the 22nd conference on computational natural language learning. Brussels: Association for computational linguistics; 2018. p. 562–72. https://doi.org/10.18653/v1/K18-1054.
Kapanipathi P, Thost V, Patel SS, Whitehead S, Abdelaziz I, Balakrishnan A, Chang M, Fadnis K, Gunasekara C, Makni B. Infusing knowledge into the textual entailment task using graph convolutional networks. In: Proceedings of the 34th AAAI conference on artificial intelligence (AAAI-20). 2020. p. 8074–81.
Wang X, Kapanipathi P, Musa R, Yu M, Talamadupula K, Abdelaziz I, Chang M, Fokoue A, Makni B, Mattei N, Talamadupula K, Fokoue A. Improving natural language inference using external knowledge in the science questions domain. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19). 2019. p. 7208–15.
Zhang L, Lyu Q, Callison-Burch C. Reasoning about goals, steps, and temporal ordering with WikiHow. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics. 2020. p. 4630–4639. https://doi.org/10.18653/v1/2020.emnlp-main.374.
Srivastava A, Rastogi A, Rao A. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. 2023.
Chen W, Wang X, Wang WY. A dataset for answering time-sensitive questions. 2021.
Fyodorov Y, Winter Y, Francez N. A natural logic inference system. In: Proceedings of the 2nd workshop on inference in computational semantics (ICoS-2). 2000.
Condoravdi C, Crouch D, Paiva V, Stolle R, Bobrow DG. Entailment, intensionality and text understanding. In: Proceedings of the HLT-NAACL 2003 workshop on text meaning. 2003. p. 38–45.
Fillmore CJ, Baker C. A frames approach to semantic analysis. In: The Oxford handbook of linguistic analysis. 2010.
Koupaee M, Wang WY. Wikihow: A large scale text summarization dataset. arXiv preprint. 2018. arXiv:1810.09305.
Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J. HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International conference on computer vision (ICCV ’19). 2019. p. 2630–2640.
Schuler K. Verbnet: a broad-coverage, comprehensive verb lexicon. PhD thesis, University of Pennsylvania. 2005.
Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, Roof B, Smith NA, Choi Y. ATOMIC: an atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI conference on artificial intelligence (AAAI-19). 2019. p. 3027–35.
Bordes A, Usunier N, Garcia-Durán A, Weston J, Yakhnenko O. Translating embeddings for modeling multi-relational data. In: Proceedings of the 27th International conference on neural information processing systems (NIPS ’13), 2013. p. 2787–95.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems (NIPS ’13), 2013. p. 3111–9.
Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics. 2019. p. 3982–92 https://doi.org/10.18653/v1/D19-1410.
Wang Z, Zhang J, Feng J, Chen Z. Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the 28th AAAI conference on artificial intelligence (AAAI-14). 2014. p. 1112–9.
Trouillon T, Welbl J, Riedel S, Gaussier É, Bouchard G. Complex embeddings for simple link prediction. In: Proceedings of the 33nd international conference on machine learning (ICML ’16). 2016. p. 2071–080.
Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact Assoc Comput Linguist. 2014;2:67–78. https://doi.org/10.1162/tacl_a_00166.
Chicco D. Siamese neural networks: an overview. Artificial Neural Networks. 3rd edition. 2021. p. 73–94.
Sun Z, Fan C, Han, Q, Sun X, Meng Y, Wu F, Li J. Self-explaining structures improve NLP models. arXiv preprint. 2020. arXiv:2012.01786.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, et al. Llama: open and efficient foundation language models. arXiv preprint. 2023. arXiv:2302.13971.
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha: Association for Computational Linguistics. 2014. p. 1532–43. https://doi.org/10.3115/v1/D14-1162.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems. 2017. p. 5998–6008.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd conference on neural information processing systems (NeurIPS ’19). 2019. p. 8026–8037.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint. 2019.arXiv:1910.03771.
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Copenhagen: Association for Computational Linguistics. 2017. p. 670–80. https://doi.org/10.18653/v1/D17-1070.
Ng N, Yee K, Baevski A, Ott M, Auli M, Edunov S. Facebook FAIR’s WMT19 news translation task submission. In: Proceedings of the fourth conference on machine translation, volume 2: shared task papers, day 1. Florence: Association for Computational Linguistics. 2019. p. 314–19. https://doi.org/10.18653/v1/W19-5333.
Xiang W, Wang B. A survey of event extraction from text. IEEE Access. 2019;7:173111–37. https://doi.org/10.1109/ACCESS.2019.2956831.
Abel F, Gao Q, Houben G-J, Tao K. Analyzing user modeling on twitter for personalized news recommendations. In: Proceedings of the 19th International conference on user modeling, adaptation, and personalization (UMAP ’11). 2011. p. 1–12.
Li P, Lu H, Kanhabua N, Zhao S, Pan G. Location inference for non-geotagged tweets in user timelines. IEEE Transact Knowl Data Eng (TKDE). 2018;31(6):1150–65.
Mnasri M. Recent advances in conversational NLP: Towards the standardization of Chatbot building. arXiv preprint. 2019. arXiv:1903.09025.
Funding
Open access funding provided by University of Innsbruck and Medical University of Innsbruck.
Author information
Authors and Affiliations
Contributions
Taishi Hosokawa has done approach design and implementation, experimentation, and paper writing. Adam Jatowt has proposed the research objective and the concept of temporal validity reassessment. Kazunari Sugiyama and Adam Jatowt have edited and extended the initial manuscript, and they were involved in discussions and research work supervision.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
We follow the ACM Code of Ethics and Professional Conduct. User data obtained in crowdsourcing has been carefully treated and no personal information on users participating in the dataset creation is released.
Competing interests
There are no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hosokawa, T., Jatowt, A. & Sugiyama, K. Temporal validity reassessment: commonsense reasoning about information obsoleteness. Discov Computing 27, 4 (2024). https://doi.org/10.1007/s10791-024-09433-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10791-024-09433-w