1 Introduction

It is rather easy for humans to reason on the validity of sentences. Consider a sentence made by a hypothetical user which could be for example sent as a microblog post: “Now waiting for a train to Munich, should be arriving here soon.”, and then that the user issues a subsequent post: “Again delay?, what’s wrong with DB trains these days?!”, we can guess that the person is very likely to be still waiting. That is, the action (i.e., waiting) stated in the former message is still ongoing, thus, the first sentence remains valid. On the other hand, if the subsequent post would be “Finally! So, Goodbye Berlin!”, it would be highly possible that the first message (the one about waiting) is no longer valid in view of this additional evidence. Taking another example, a person sees the title of a news article as follows: “Polish PM visits White House” while two days later another news title catches her eye: “Polish Prime Minister Donald Tusk speaks during a press conference after a government meeting in Warsaw”. The latter information implicitly indicates the completion of the PM’s visit to USA, i.e., the former sentence is no longer valid; however, this would not be the case if the press release had been, for instance, in New York. Humans can do this kind of inference quite well given the commonsense and our world knowledge, while for machines such sentences may still pose challenges.

Computers have demonstrated significant advantages in the field of natural language comprehension as a result of the emergence of pre-training models [1]. However, it still remains difficult for machines to perform effective reasoning that requires common sense [2]. As the amount of available text information has exploded in the modern digital era, it has become increasingly crucial for machines to acquire a much deeper understanding of natural language.

We investigate a novel temporal commonsense reasoning task called Temporal Validity Reassessment (TVR) which involves reasoning about stationarity and action continuation or abortion/completion based on additional related content considered as evidence. As in the previous example, the temporal validity of the sentence “Now waiting for a train to Munich, should be arriving here soon.” needs to be judged in view of supplementary content, which in social media scenarios is assumed to be published after the first sentence (see Fig. 1). Note that this task involves a pair of sentences where one is a hypothesis and the other is premise, hence, there is a natural resemblance to Natural Language Inference (NLI) taskFootnote 1 [3], and so TVR could be actually also called Temporal Natural Language Inference as in our prior publication [4] on which the current paper is based. While the concept of temporal validity estimation is general, that is, any size of content can be considered as input, for the sake of simplicity, we take pairs of sentences as an input, which is also in line with an input to NLI task.

TVR could be both beneficial for diverse Natural Language Processing (NLP) tasks as well as for improving Information Retrieval (IR) since estimating obsoleteness and timeliness of search results could be then improved leading to more efficient temporal information retrieval models [5]. Sec. 6 expands the discussion on the applications of approaches for reasoning about text temporal validity.

Similar to NLI, we address TVR problem as a classification task using the two input sentences: hypothesis and premise. Note that the order is opposite when compared to NLI task where usually premise was given as the first sentence followed by the hypothesis and the relation of both was supposed to be judged. In our task, the sentence order is particularly important since it implies chronology. The first is the hypothesis sentence, whose validity is to be assessed followed by the premise sentence aiming to provide new information useful for reasoning about the temporal state of the hypothesis. We set up the following classification labels for the hypothesis sentence: Supported: the hypothesis is still valid in view of the premise, Invalidated: the hypothesis ceased to be valid based on the premise content, and Unknown: premise does not provide any clue about validity of the hypothesis sentence. Note that, in traditional settings of NLI, the semantics of the labels are such that they evaluate whether the premise entails or contradicts the hypothesis. In TVR, on the other hand, the labels mean whether the premise supports (action still ongoing) or invalidates (action aborted or completed) the hypothesis, or the premise does not provide useful information about the temporal state of the hypothesis. As mentioned before, while there are significant similarities between TVR and NLI in the input format, number, and role of classes, the temporal and continuity-related character of the classes in TVR makes this task relatively distinct. In the later part of the paper, we explore the possibility of using NLI datasets for pretraining machine learning models for TVR task.

Considering our earlier example, we regard “Now waiting for a train to Munich, should be arriving here soon.” as a hypothesis since its validity is unknown, and “Again delay?, what’s wrong with DB trains these days?!” or “Finally! So, Goodbye Berlin!” as a premise.Footnote 2 Almquist and Jatowt [6] focused on another variant of temporal validity estimation problem (called also information expiry date estimation) in which a single sentence (hypothesis only) formed an input and the task was to decide the minimum length of a time period (usually represented as few validity duration classes) during which the hypothesis should remain valid. For example, in the above example “Now waiting for a train to Munich, should be arriving here soon.” we would assume a validity to be few minutes or few hours but rather not few days or weeks. Note that the two tasks can be combined so that both the elapsed time and the additional context can be used for judging whether the target sentence remains valid.

1.1 Contributions

Besides the introduction of a novel task, we also describe the knowledge-enhanced method and a dedicated dataset for the proposed task. Our approach incorporates information from a large knowledge graph (ATOMIC-2020 [7]) that holds commonsense knowledge since we believe that effective approach requires large-scale and deep knowledge about the world.

The proposed model combines the following two encoders: the first one represents commonsense knowledge from ATOMIC-2020 via pre-training and the second one encodes commonsense knowledge from the training dataset we use. Both encoders jointly reason with the commonsense knowledge in each input pair of sentences. The dataset that we construct and provide contains over 10k pairs of sentences describing concrete actions with the labels denoting temporal validity of hypotheses. We have released both the dataset and the associated code for our approach.Footnote 3

The remainder of this paper is composed of the following parts. In Section 2, we survey the related work while we formally describe the task and introduce our approach to solving it in Section 3. We explain the way in which we constructed dataset in Section 4, and we also describe experimental settings as well as discuss the experimental results in the same section. Section 6 provides the overview of the applications of our proposed task, while the last section concludes the paper.

2 Related work

We discuss the related work starting with a background on temporal commonsense reasoning and a natural language inference. We later compare our task with the related ones, and overview also similar datasets and related tasks.

2.1 Temporal commonsense reasoning

Several tasks in Natural Language Processing and Information Retrieval domains consider time as key aspect of text [8,9,10,11,12,13,14,15,16], including temporal understanding of stories [17, 18], temporal relation extraction [19, 20], temporal question answering [21, 22], and so on. Many of such works make use of temporal expressions embedded in text or of document timestamps (i.e., publication dates) [5, 23].

Fig. 1
figure 1

A schematic comparison of TVR task (right) with Temporal Validity Duration Prediction (TVDP) introduced in [6] (left) on the case of microblog posts. Post A indicates here a hypothesis sentence while post B is a premise

Implicit information that humans commonly know is addressed in, what is called, Commonsense Reasoning domain [24]. Winograd Schema Challenge [25] was one of the earliest challenges for machines in this regard, and many other challenges and approaches have also been proposed [26,27,28,29,30,31,32]. Temporal Commonsense is one of them, in which temporal challenges are addressed [33]. Zhou et al. [34] focused on comparing actions such as “going on a vacation” with others like “going for a walk” to assess which take longer, and constructed a dataset for question-answering including this kind of estimation. In particular, the subset of that dataset that relates to stationarity is relevant to our work. We further compare our task with other related ones in Sect. 2.5.

White and Awadallah [35] estimated the duration of tasks assigned by users in calendars. Takemura and Tajima [36] classified microblog posts to different lifetimes based on features specific to Twitter such as number of followers or presence of URLs. Almquist and Jatowt [6] examined the validity of sentences considering the time elapsed since their creation (more in Sect. 2.5). They introduced a novel task of estimating the validity period of a sentence and constructed a dedicated dataset composed of sentences from Wikipedia, news and social media. The most recent extension of the temporal validity research involves determination of the direction of change in temporal validity (increase, decrease or neutral) [37].

To probe models’ common sense in regards to temporal relations and reasoning, TimeDial [38] and MC-TACO [34] datasets were established embracing a diverse array of situations and types of temporal information. We overview common datasets related to temporal commonsense reasoning in Section 2.5. More information about datasets is also provided in the recent survey [39]. Jain et al. [40] has analyzed the performance of diverse large language models (LLMs) across different temporal commonsense reasoning tasks, finding that such reasoning still poses significant challenges for LLMs.

Additionally, various developments in time-aware training and representation strategies for language models have also been proposed recently [41,42,43,44] including temporal reasoning approaches based on knowledge graphs [45, 46]. Overall, contemporary research has exhibited a notable expansion in temporal reasoning studies in natural language understanding.

2.2 Natural Language Inference

Recently, Natural Language Understanding (NLU) by computers has attracted a lot of researchers’ attention. Natural Language Inference (NLI) or Recognizing Textual Entailment is one of NLU domains, in which computers deal with input in the form of two sentences [3], similar to our proposed task. NLI problems require to determine that a premise sentence entails, contradicts, or is neutral to a hypothesis sentence (or in some settings, entails vs. not entails). In the early stages of NLI research, Dagan et al. [47] constructed a relatively small dataset. The first largely annotated dataset was Stanford Natural Language Inference (SNLI) dataset [48], which was annotated through crowdsourcing. After that, many NLI datasets [49, 50], including Multi-genre Natural Language Inference (MNLI) [51] and Scitail [52], have been constructed. Vashishtha et al. [53] also converted existing datasets for temporal reasoning into NLI format, pointing out that there was no NLI dataset dedicated to temporal reasoning. Their task focuses on explicit temporal description while our task tackles implicit information. The emergence of many large datasets made it possible to train more complex models [3, 54]. Remarkably, pre-trained models such as BERT [55] and RoBERTa [56] demonstrated significant performance on NLI datasets, and were also used to train multi-task models [57].

2.3 Incorporation of knowledge bases

Generally, NLU works make use of Knowledge Graphs (KG) or Knowledge Bases (KB) to improve model performance [58,59,60]. Especially, commonsense reasoning works commonly incorporate knowledge from large KBs such as ConceptNet [61, 62] and WikiData [63] in their architectures [64,65,66]. However, only a few works in NLI attempt to incorporate KGs into computational models [67, 68]. Wang et al. [69], for example, improve performance on Scitail using knowledge borrowed from ConceptNet.

2.4 Comparison with related tasks

Similar to NLI, our work addresses a text classification problem, in which two sentences form an input. However, we focus on neither entailment nor contradiction but on the validity of sentences (see Tables 1 and 2 for comparison).

Table 1 Example sentences and TVR labels from our dataset
Table 2 Example sentences of NLI task from SNLI dataset (borrowed from SNLI website: https://nlp.stanford.edu/projects/snli/)

The NLI dataset constructed by Vashishtha et al. [53] includes temporal phenomena. However, their task addresses explicit descriptions of temporal relations such as duration and order, while we focus on implicit temporal information that is latent in sentences, similar to [34]. The problem that the task deals with is reasoning about event duration, ordering, and frequency in a separate manner. However, our approach requires a more comprehensive understanding of temporal phenomena through a contrastive type inference. Also, their task is posed as a question-answering problem while ours is formalized as an NLI type problem. Almquist and Jatowt [6] also worked on the validity of sentences. Unlike their work, we use premises as the additional source, instead of the information on the elapsed time from sentence creation as in [6], since, in many practical situations, additional text is available (e.g., sequences of tweets posted by the same user, or following sentences in a story or novel). Another recently proposed task, called Temporal Validity Change Prediction (TVCP) [37], is similar to TVR, however, it requires inference whether the temporal validity has increased, decreased or rather remained on same level given additional context. Figure 1 shows the comparison of our task with the one proposed in [6]. Table 4 also compares TVR task with the most related ones.

2.5 Similar datasets

Table 3 Datasets summary

There are several temporal commonsense reasoning datasets available. In the following, we list the main ones and summarize them in Table 3:

MC-TACO [34]: Given a context, a question, and a candidate response, the objective is to determine whether the candidate answer is “yes” (plausible) or “no” (implausible). The dataset focuses on assessing the plausibility of the answer within the temporal context provided.

TimeDial [38]: Dataset of a multiple-choice cloze task featuring over 1.1K carefully curated dialogues. The dialogues require an understanding of temporal commonsense concepts interwoven with the presented events.

WikiHow [70]: Given a goal and a number of steps, a system has to determine if the steps are in the correct temporal order.

BIG-bench [71]: Provided with a sequence of finished events, each with its defined timeframe, the model needs to determine when an individual might have been available for an unscheduled activity. While both BIG-bench and WikiHow encompass various other reasoning tasks, we specifically focus in this work only on temporal reasoning.

TimeQA [72]: This dataset comprises a series of time-sensitive question-answer pairs. Answering these questions involves understanding and reasoning within a longer context that requires temporal comprehension.

Note that for the case of TVR, the model has to ascertain the validity of textual content by using additional associated content as corroborating evidence. Our dataset is then complementary to the above-listed ones.

We also note that the above-listed datasets actually cover most of the temporal commonsense reasoning styles according to the categorization proposed by Zhou et al. [34]:

Event duration (ED): reasoning about event durations.

Event ordering (EO): reasoning about the typical sequence of events.

Frequency (F): reasoning about the frequency of event occurrences.

Stationarity (S): reasoning about the length of state persistence.

Typical time (TT): reasoning about the specific timing of events.

Table 3, besides summarizing the datasets, provides also the information on the types of their temporal commonsense reasoning (cf. the last column), and the characteristics of their tasks, the format of the output, and the evaluation metrics applied. TVR task requires the combination of the information related to Stationarity, Event Duration, as well as Event Ordering types. It also directly considers the notion of changes in the information validity and obsoleteness.

Table 4 Comparison of our work with the most related tasks

3 Proposed method

3.1 Task definition

We first provide the definition of our task. Let \(p = (s_1, s_2)\) be a pair of sentences where \(s_1\) and \(s_2\) are a hypothesis and a premise sentence, respectively. The sentences are in temporal order \(t_{s_1} \le t_{s_2}\) where \(t_{s_{id}}\) \((id=1,2)\) is the creation time, or a reading time of a sentence \(s_{id}\) (e.g., in the case of receiving microblog posts issued by the user, or when reading subsequent sentences of a story or a novel). The task is to assign one of the following three validity classes to \(s_1\) through the inference on \(s_1\) based on the content of \(s_2\):

$$\begin{aligned} c \in \{\textsc {supported}, \textsc {invalidated}, \textsc {unknown}\} \end{aligned}$$
(1)

The semantics of the classes are as follows:

  • Supported: class means that \(s_1\) remains still valid at \(t_{s_2}\) given information in \(s_2\).

  • Invalidated: class means that \(s_1\) ceased to be valid at \(t_{s_2}\) in view of \(s_2\).

  • Unknown: class indicates that the situational evidence is not conclusive or clear, and nothing can be said regarding the validity of the hypothesis (hence it can neither be supported nor invalidated).

As mentioned earlier, while we focus on the case of individual sentences, larger text chunks such as paragraphs could be considered instead. This, however, would pose more complexities since multiple actions with various inter-relations might be expressed in longer text portions.

We note that the order that \(s_1\) is followed by \(s_2\), either as represented by the order of their creation/posting dates (e.g., in case of Twitter) or by a sentence order in text (e.g., a story or a novel) may not always be necessary in reality. Although such an order is most natural and common, there might be the cases of retrospection in narratives, or swapped order of creating/sending messages. We leave the investigation of such cases for future work.

Finally, we note that, for simplicity, we do not consider in this paper hypothesis sentences expressing future or past actions but only ones that describe ongoing actions. The case of sentences about the past is rather trivial (e.g., "WWII started in 1939 with the attack on Poland") since they are in general valid. On the other hand, the case of sentences about future such as ones expressing plans, forecasts and expectations is somewhat difficult to evaluate, hence, in the current experiments, we only focus on present actions and states.

3.2 Methodology

Given the requirement for temporal commonsense reasoning when judging the validity, we believe that it makes sense to incorporate external knowledge about common human actions and their temporal aspects. Consequently, we first discuss relevant knowledge bases that could provide information about temporal properties of common user actions. Then, we propose a new neural network-based architecture that combines two encoders. The first encoder utilizes the information drawn from the knowledge base while the second encoder, the text encoder, is based only on the text data. The combined output of the encoder utilizing the knowledge base and the text encoder is then used as input to the softmax classifier. Figure 2 illustrates the architecture of our model.

Fig. 2
figure 2

Architecture of the proposed model

3.2.1 Encoding knowledge

As mentioned before, one of the components of our model is the knowledge encoder. It is however necessary to first select a suitable knowledge base for making it effective for our task. We then describe our choice of a knowledge base and discuss how we encode its knowledge.

Several different knowledge bases (KBs) could be useful to achieve our goal. We have considered the following ones: FrameNet [75], WikiHow [76], Howto100m [77], and VerbNet [78]. We, however, concluded that ATOMIC-2020 (An ATlas Of MachIne Commonsense) [7] would be the most suitable KB for our purpose thanks to its relatively large scale (1.33M commonsense knowledge tuples and 23 commonsense relations) as well as because it contains temporal commonsense relations.

ATOMIC [79] is the predecessor KB of ATOMIC-2020 designed for commonsense reasoning, which contains nine different if-then relations such as Cause, Effect, Intention, Reaction, and so on. Most of the entities in ATOMIC are expressed as short text or phrases. ATOMIC-2020 is the subsequent version of ATOMIC which incorporates new relations between events such as “IsAfter”, “IsBefore”, “HasSubevent”, etc. For example, “PersonX pays PersonY a compliment” and “PersonX will want to chat with PersonY” are sentences belonging to the if-then relation in Atomic-2020, while “PersonX bakes bread” and “PersonX needed to buy ingredients” is an example of a pair of sentences connected by the “IsAfter” relation.

3.3 TransE

We adapt TransE [80] model to represent commonsense relations between events. In the following, we briefly explain the idea behind this adaptation starting with the explanation of TransE itself. TransE is a model for learning embeddings of KBs represented in the triple form involving entities and their relations which are represented as follows: [head entity, relation, tail entity].

In TransE, the relations are addressed as translations in the vector space. TransE learns embeddings using a loss function being an operation involving entities and their relations (similar to skip-gram [81]) so that the following is sought to be preserved: head entity + relation = tail entity:

$$\begin{aligned} \mathcal {L} = \sum \limits _{(\textbf{h}, \textbf{l}, \textbf{t}) \in S} \sum _{(\mathbf {h'}, \textbf{l}, \mathbf {t'}) \in S'_{(\textbf{h}, \textbf{l}, \textbf{t})}} [\gamma + d(\textbf{h} + \textbf{l}, \textbf{t}) - d(\mathbf {h'} + \textbf{l}, \mathbf {t'})]_+, \end{aligned}$$
(2)

where \([x]_+\) denotes the positive part of x, \(\gamma \) is a margin parameter, and d is the distance function. \(\textbf{h}\), \(\textbf{l}\), and \(\textbf{t}\) are the embeddings of head entity, relation label, and tail entity, respectively. S is a set of positive data instances, while \(S'\) is the set of negative ones.

3.4 Adapting TransE for ATOMIC-2020 sentences

As mentioned before, the entities in the ATOMIC-2020 dataset are represented as short phrases or sentences. In the following, we show two examples of relations and their head and tail entitiesFootnote 4:

  • x get x’s car repaired, happens before, person spent a fortune

  • x runs out of steam, is after, x exercises in the gym

In the case of original TransE method, the same entities were assumed to occur multiple times in the knowledge base. On the other hand, in the case of ATOMIC-2020 entities, the number of potential entities is quite large as there can be multiple diverse phrases used to represent arbitrary human actions. Also, it is rather rare that the same phrase appears during the inference (or testing) time as the ones used for training.

To solve this problem, we adapt the TransE model to text chunks instead of just entities as originally used. Figure 3 shows the detailed structure of our model. First, we compute a sentence vector corresponding to each phrase in the KG using Sentence-BERT (SBERT) [82]. Then, we train the weights W for the sentence vectors and the relation embedding \(E_r\) using Margin Based Ranking Loss as in TransE. The weights of SBERT are fixed and not trained. Since our task relies on temporal commonsense reasoning, we select only “IsAfter” and “IsBefore” relations from ATOMIC-2020 for our model.

After the pre-training with TransE is completed, we construct an encoder for the downstream task using the embeddings of the TransE model that were pre-trained on ATOMIC-2020 knowledge base. In the encoder, the output is the concatenation of the embeddings of the hypothesis and the ones of the premise sentence.

Fig. 3
figure 3

TransE model for sentences

3.5 Other knowledge embedding models

While TransE may probably be the most often used translation model, several newer variants are also available. In our experiments, we will then also test other variants of translating embedding models like TransH [83] and ComplEx [84] in place of TransE. Both of the models are actually the extensions of TransE.

TransH extends TransE by applying the translation from head to tail entity in a relational-specific hyperplane. This is done to address inability of TransE to model one-to-many, many-to-one, and many-to-many relations. When applying TransH, we use the same knowledge embedding model as TransE except for incorporating an additional module being applied for projection. In order to project each relation onto the hyperplane, we use a relation-specific projection matrix, same as it is done in the original TransH model.

ComplEx model is based on combining complex-valued entity and relation representations. When applying ComplEx model, we add a linear layer after sentence embedding so that the model has two different parallel linear layers to transform sentence embeddings, where one represents real part, and the other is for imaginary part.

3.5.1 Combined model

As shown in Fig. 2, our final model consists of the combination of text encoder and knowledge encoder, together with the classification layer on top of them. As the dimensions of the pre-trained knowledge embeddings and the output of the text encoder differ, we linearly transform them to make the sizes of the embeddings equal. We then apply the concatenation, calculation, and element-wise product for combining both embedding vectors:

$$\begin{aligned} \textbf{H} = Linear(\mathbf {H_t};\, \mathbf {H_k};\, \mathbf {H_t}-\mathbf {H_k};\, \mathbf {H_t}\,\odot \mathbf {H_k}),\, \end{aligned}$$
(3)

where \(\mathbf {H_t}\) is the output of the text encoder, \(\mathbf {H_k}\) is the output of knowledge encoder, and \(\odot \) denotes the operation of element-wise multiplication. Finally, the obtained output is linearly transformed, and fed into a softmax classifier which is tasked with deciding the validity class.

3.5.2 Knowledge-encoder only

For comparison, in our experiments, we also test knowledge-encoder only version of the model. In this model variant, we exclude textual encoder and the comparison layer from the combined version. In this case, the linear layer that used for dimension adjustment is not necessary. However, we still keep this layer for the purpose of comparison.

4 Dataset

We construct a new dataset in order to evaluate our method designed for the proposed task. As mentioned earlier, each data instance should be composed of a pair of a hypothesis and premise sentences together with a label denoting the validity of the hypothesis. This formulation bears strong resemblance to Natural Language Inference problem in NLP also known as Text Entailment detection.

To begin with, we need seed sentences for which we could create corresponding sentences that would fall into one of the three validity classes. We decided to randomly select 5000 premise sentences from SNLI datasetFootnote 5 and use them as the hypothesis sentences. The sentences in SNLI were originally the caption sentences of images being part of the Flickr30k corpus [85].

Since we wanted topically balanced dataset, we first clustered all the collected sentences using BERT [55] for sentence representation and k-means (with \(k = 100\)) as the clustering algorithm. Then, we sampled an equal number of sentences from each cluster (in particular, we extracted randomly up to 50 sentences per cluster) to maintain high variation of sentence meaning.

Having the hypothesis sentences, we next created their premise sentences as well as class labels using crowdsourcing. We used the Amazon Mechanical TurkFootnote 6 as the crowdsourcing platform. In particular, for each hypothesis, we asked two crowdworkers to create a sentence corresponding to each class label. In order to prevent copying, the candidate premise sentences were accepted only when 40% or more words were not in the corresponding hypothesis sentence. This was done by tokenization and the calculation of word overlap between hypotheses and premises. The similarity checking step was necessary as some crowdworkers tried to create sentences with minimal effort (e.g., by modifying a single word or a few words). Such trivial sentences were especially likely to be made for the supported class.

During making premise sentences, we also asked participants to provide a description of the estimated time during which the hypotheses sentences could have been actually valid. The reason for this was to make sure that the workers carefully consider temporality of text when creating content. This information was, however, not included in the final dataset. In total, about 400 workers participated in the dataset creation. As we found out some noisy sentences, we later went through over all the sentences in the dataset to verify their quality. Whenever necessary, grammar was manually corrected while some poor-quality, noisy, offensive, or overly-personal sentences were removed.

In total, we removed 19,341 pairs of sentences. Our verification has been done by checking: if the written sentence contains anything offensive or personal, whether the sentence copies the corresponding hypothesis, if the sentence matches corresponding hypothesis, if the sentence contains any grammatical errors, if the sentence contains any spelling errors and so on. The final dataset includes 10,659 sentence pairs. The number of sentence pairs for each class is same (3553 sentences). Table 1 shows a few examples of the generated data. For comparison, we also present example sentences of the NLI task in Table 2.

In Table  5, we show the average number of words across different classes. The previous research has indicated that the number of words in sentences in some NLI datasets varies significantly depending on their labels [52] which may adversely impact the work of classifiers. As it can be seen, the average number of words in our dataset is almost the same for different labels, although the variance is bit high for the unknown class. It is also interesting that premise sentences are on average shorter than the hypothesis sentences.

Table 5 Average sentence length in our dataset expressed in number of words

5 Experiments

5.1 Experimental settings

In our experiments, all compared models undergo 5-fold non-nested cross-validation. The batch size is 16, and the learning rate is determined by the performance on the validation fold selected from among 0.005, 0.0005, and 0.00005. The optimal value for all models was 0.00005. Accuracy, the proportion of correct responses, is the most relevant and extensively used metric for NLI tasks and is used to evaluate our approach, too.

Besides ours, we also test the following models:

  • BERT (bert-base-uncased) [55],

  • Siamese Network [86],

  • SBERT [82] Embeddings with Feedforward Network,

  • Self-Explaining Model [87].

Additionally, we also include the results of GPT 3.5 and Llama 7B [88] in few shot settings (three examples) as originally reported in [40].

We set an architecture of Siamese Network similar to the one used in Bowman et al. [48]’s work which utilizes the 8B version of GloVe embeddings [89] and multiple tanh layers. The number of trainable parameters is 240,181,403.

SBERT is a fine-tuned model of BERT, RoBERTa and their variants designed to calculate semantic similarity between two sentences. While BERT has demonstrated state-of-the-art performance on a lot of downstream tasks, it was unclear how BERT can encode sentences. In addition, calculation of similarity as a two-input regression task requires longer time. SBERT has been proposed to solve these problems. In our experiments, we use 3 hidden layers each with 500 dimensions, ReLU activation, and dropout rates of 75%. The output layer is based on a softmax classifier. The number of trainable parameters for this model is 1,271,003.

The last tested approach, the Self-Explaining [87] is a state-of-the-art model for SNLI dataset and it has 127,008,769 trainable parameters. Self-Explaining model is equipped with an attention-like Self-Explaining layer composed of three layers: Span Infor Collecting (SIC) layer, Interpretation layer, and an output layer. The Self-Explaining layer is placed on top of a text encoder which is RoBERTa-base [56], same as in the original version of the model. Robustly optimized BERT approach (RoBERTa) is a BERT-based model using modified pre-training data and optimized hyperparameters. RoBERTa model is known to achieve significant improvement in the performance compared to BERT [90] for downstream tasks.

The SIC layer outputs the embedding for each word span. Suppose that a sentence consists of words \(w_1,..., w_K\). A word span is a sequence of words \(w_i, w_{i+1},..., w_j\) within the interval \([i, j] (1 \le i < j \le K)\). We denote the interval embedding \(\mathbf {s_{ij}}\). Here, the encoder outputs the word embeddings \(\mathbf {e_1},..., \mathbf {e_K}\) of the encoded words, and using this as input, the interval embedding is \(\mathbf {h_{ij}}=F(\mathbf {e_i}, \mathbf {e_j})\) where F is an arbitrary function. The final output of the SIC layer is \(\textbf{H} = (\mathbf {h_{12}}; \mathbf {h_{13}};...; \mathbf {h_{i-1j}})\)where “;” denotes concatenation. The Interpretation layer aggregates the output of the SIC layer, by first calculating the importance of each span embedding:

$$\begin{aligned} o(i, j) = \varvec{\tilde{h}^\top } \mathbf {h(i, j)} \\\alpha (i, j) = \frac{exp(o(i, j))}{\sum \limits _{i \in [1, K], j \in [i, K]}exp(o(i,j))} \end{aligned}$$

The weighted average of these values forms the output of this layer:

\(\varvec{\hat{h}}= \sum _{i \in [1,K], j \in [i,K]}\alpha (i, j)\textbf{h}(i j).\) The final output of the Self-Explaining model is the output of the softmax classifier, and the loss for training the model is defined as: \(L=\log p(y|\textbf{x}) + \lambda \sum \limits _{i,j}\alpha ^2(i, j).\)

For testing our proposed architecture, we experiment with the two types of contextual encoders: Siamese Network and Self-Explaining model. The dimensionality of the entity embeddings was set to 256, but the combined embeddings are linearly transformed to 128 dimensions in order to match the dimensionality of each encoder.

We implemented all the models using PyTorch [91] with HuggingFace’s transformers [92] on a machine equipped with GPU.

5.2 Experiments with NLI pre-training

We first experiment with NLI datasets given the similarity between NLI and our task as well as the availability of large-scale datasets for Natural Language Inference. Usually, it has been established that pre-training with NLI datasets improves accuracy in many downstream tasks [93]. Given the similarity and relevance of NLI for our task, we decided to first analyze if pre-training selected models using the NLI datasets and then fine-tuning them on the proposed task could help boost the accuracy of results. For this experiment, we used the training sets of SNLI 1.0 and MNLI 0.9 datasets including 550,152 and 392,702 instances, respectively. We also needed to establish a mapping between NLI classes and the three classes of our task. We decided for the following correspondence:

  • Supported = entailment class of NLI,

  • Invalidated = contradiction class of NLI,

  • Unknown to neutral = class of NLI.

We believe that this class mapping is most suitable based on semantics of each class. Table 6 presents the results obtained by NLI pre-training. The results indicate that the NLI data has certain relevance and usefulness for our task as it improves the results for Siamese network. However, the results obtained for the Self-Explaining model were not improved, likely because this model is an already pre-trained one. Interestingly, quite a large drop in accuracy occurs when using the MNLI dataset, although the actual reason for this remains unclear. Nevertheless, the results indicate that NLI datasets contain information that could be to some degree useful for our task, especially for relatively simpler models like Siamese Nets.

Table 6 NLI pre-training results in Siamese Network and Self-Explaining Model

5.3 Main results

We show the main results in Table 7. According to the results, we observe that the Self-Explaining model achieves the best accuracy. BERT provides the worst results, likely, because of unstable training of BERT as also pointed out in [55]. TVR task seems also challenging for large language models; the LLMs we use do not perform well on this task, even in few-shot setting.

Another notable point is that incorporating commonsense knowledge from external knowledge base improves the accuracy in both the Siamese and Self-Explaining models. However, the improvement is smallest for Self-Explaining with RoBERTa as the pretraining module. RoBERTa has been trained on larger amount of data than BERT and it was subject to careful optimization.

Nevertheless, we observe that the improvement when incorporating our solution based on TransE is relatively good. The results rise from 0.715 to 0.784 (9.6% improvement) for Siamese Net and from 0.805 to 0.819 (1.7% improvement) for the Self-Explaining model. In general, we conclude that adding commonsense data in the proposed way is a promising direction for the proposed task. We believe that exploration of more sophisticated commonsense reasoning approaches and more comprehensive datasets could further improve the results. We later explore the dataset extension to see if we could further boost the accuracy in simple and inexpensive ways.

Our conclusions are also supported by the confusion matrices for the Siamese model as shown in Fig. 4. Incorporating TransE results in more correct determination of the supported and unknown classes (improvement by 28% and 6.5%, respectively). On the other hand, it only slightly confuses the invalidated class (the decrease of 1.4%).

Table 7 Results on TVR task

In Table 8, we also show the examples of data instances misclassified by our approach.

Table 8 Wrong predictions generated by our approach
Fig. 4
figure 4

Confusion matrices for TVR task of Siamese network (left) and Siamese network with TransE (right). The horizontal axis corresponds to the prediction (\(x^P\)) and vertical one to gold labels (\(x^G\)). The left (upper) blocks are invalidated, middle ones are supported, and right (bottom) ones are unknown

5.4 Testing different knowledge types

We next investigate other ATOMIC-2020 relationships to determine if they would be applicable to our task. ATOMIC-2020 incorporates three kinds of fundamental relations: physical-entity, social-interaction, and event-centered. We then utilize all event-centered relationships including isAfter, isBefore that have been already used, as well as the newly added relations of HasSubEvent, HinderedBy, Causes, xReason, and isFilledBy. Due to the fact that both our objective and dataset require reasoning about the relationships between two events, we take the previously mentioned subset of ATOMIC-2020 data as relevant and valuable for our particular task.

Table 9 shows the results of TransE trained with the aforementioned data using the Self-Explaning model and Siamese models. When looking at the results and comparing with the previous results, we observe that the incorporation of additional knowledge into the Siamese Network enhances the accuracy of our task improving the results by 1.8% (the change from 0.784 to 0.798). On the other hand, the Self-Explaining model’s accuracy decreases by 4.6% (the decrease from 0.819 to 0.783). This would again suggest that the performance of simpler models which are not pretrained can be enhanced by incorporating more data, while for more complex methods additional relations about events are likely to confuse the model.

Table 9 Results of TransE trained by event-centered relations in ATOMIC-2020 on TVR task

5.5 Augmenting dataset

We have also tried to automatically augment our dataset as it contains relatively small number of data instances. Data augmentation is a machine learning technique that adds new instances to a dataset at hand, and these instances are typically generated automatically by synonym replacement, back-translation, text shortening or extension, or by other approaches. We adopted back-translation in this study. In particular, we have used facebook/wmt19-en-de and facebook/wmt19-de-en [94] as translation models, which means that the sentences were translated from English to German and then backwards from German to English. We then added the newly generated additional data to the training set resulting in incorporation of over 12k new sentence pairs.

Table 10 shows the results of the experiments when using the augmented dataset. In total, we used 18k sentences for training (the combined original training data and the new training data). According to the table, we observe that for the Siamese Network, the accuracy is slightly better than the one of the original results (0.715 in Table 7), however, that of Self-Explaining model is slightly worse than before (0.873 in Table 7). Overall, as we observe the results do not undergo a large improvement in terms of the accuracy or sometimes even cause drop in accuracy. This is likely because the augmented data are still too similar to the original dataset, so that models likely cannot learn much from it. We believe that more manually annotated data could further help the performance, or other data augmentation techniques could be experimented with. We will address this in our future work.

Table 10 Results of Siamese Network and the Self-Explaining model on TVR task with data augmentation

5.6 Combination methods of knowledge embedding

We next experiment with different combination methods of signals obtained from processing premise and hypothesis sentences. We consider a simple concatenation as a baseline method to produce combined embedding of hypothesis and premise. This makes both the embedding models and upper layers to be rather simple. Since upper layers are relatively simple, we believe that it might be difficult to effectively process complex embedding. Hence, we explore other combination variants, such as subtraction and the combination of concatenation, subtraction, element-wise multiplication in knowledge-encoder only setting. In this experiment, the concatenation method is considered as a baseline.

Table 11 shows the results of our exploration of the different ways to produce knowledge embedding when using the proposed model. The results show that as the amount of information increases, the performance gets improved. Hence, we then also test the version with the Self-Explaining model equipped with concatenation, subtraction, and multiplication. Table 12 shows this result. Unlike the proposed model, the accuracy of the combined concatenation, subtraction, and multiplication is, however, worse than that of concatenation. Thus, different ways of combining of the output of Self-Explaining model and the one of knowledge embedding can be too complex to process for upper layers of the Self-Explaining model.

Table 11 Results in Knowledge-Encoder only model with the variants of the way to combine the embeddings
Table 12 The result of the Self-Explaining model combined with variants of ways to combine Knowledge-Embedding

5.7 Testing different knowledge embedding approaches

Finally, we explore different approaches of translating embedding models that could be used as substitution of TransE. Table 13 shows the results of TransE variants combined with the Self-Explaining model. According to the table, the loss during the pre-training does not go down in TransH [83] and ComplEx [84]. As the loss remains high, the accuracy with the proposed downstream task is lower, indicating that the proposed architecture requires simpler ways to construct the knowledge-based embeddings for TNLI. TransH and ComplEx are more complex models than TransE and as the results show that they negatively impact the accuracy. Another possibility is that the models that we tested were not supplied with sufficiently large knowledge to properly benefit from their more complex architectures.

Table 13 Results of TransE variants used with Self-Explaining model on TNLI task

6 Applications

This section discusses some applications of the proposed task, and the proposed model.

Support for story understanding and event extraction: Effective methods trained for the proposed task can lead to better comprehension of textual narratives and, potentially, more accurate event extraction [95]. Given the evidence provided in the following sentences, a component that can reason about action completion would, for example, enhance the reading comprehension of stories. We observe that such knowledge is frequently stated in implicit way.

Recommendation of microblog posts: Microblog entries can remain valid for different length of time. In the era of information overload, users generally would wish to select and read only valid messages from a large number of postings as such messages are typically the most relevant and significant. The information overload could be reduced if users had the option to filter out invalid posts from their timelines, or if posts were ranked based on a number of factors, including their anticipated temporal validity.

User tracking and analysis: User tracking and analysis in social networks services (SNSs) [8, 96, 97] can be supported by temporal processing of user’s posts so that the user’s current situation and action can be known at each time point. Temporally targeted ads can become then possible. Additionally, in emergency situations it would be easier to confirm user whereabouts and safety.

Chatbots: Finally, chatbots and AI assistants [98] could be equipped with means for better understanding user contexts, plans, required actions and stories by users to maintain better communication with humans.

7 Conclusion and future work

There are still numerous challenges in computational processing of implicit temporal information in natural language. In this work, we propose a novel task for reasoning about the temporal validity of sentences based on a follow-up additional context, and also design as well as train a new model with an embedded knowledge base for this task. Inspired by the observation that humans can evaluate validity of text based on their temporal commonsense, we aim at equipping machines with the same ability.

We formally define the task, and construct a dedicated dataset for it, along with selecting a suitable dataset and proposing a new method for embedding sentences within knowledge bases using machine learning.

We believe that our work can contribute to better comprehension of how to rely on knowledge bases containing sentences as entities and enhance the precision of TVR task or other types of temporal commonsense reasoning. We have also conducted experiments with NLI data as well as with data augmentation technique to determine if they can be utilized for the proposed task.

In future work, we plan to extend the dataset both through more extensive manual annotation as well as by applying more effective data augmentation approaches. The latter can be obtained through using Large Language Models for paraphrasing the existing sentences as well as by prompting the models to create more instances of similar sentences. Naturally, such data augmentation needs to be carried together with strict manual checks, same as in the case of crowdsourced data creation. The new dataset not only should be in the form of sentence pairs but also should contain larger chunks of text such as entire paragraphs. Another future direction is to extend the task itself. The timestamp of the premise and hypothesis sentences could be used as an additional signal to identify the validity of a hypothesis sentence [6] in addition to judgments based on the content of premise sentences. This would lead to a more general task and more potential applications.