1 Introduction

The extraction of temporal information from text is fundamental for language understanding [12] and an important sub-task for several language processing applications [45], such as text summarisation and knowledge base population. Processing a temporal expression (timex) from text, i.e. extracting and modelling the expression, includes tasks such as recognition and representation of the temporal information [26]. Solving challenging computational problems involving time has been a critical component in the development of information extraction (IE) systems [4], e.g. understanding how such elements that describe temporal concepts can be formally represented and what procedures should be performed by an algorithm to deal with the set of operations that we as humans seem to perform relatively easy [14].

In many situations, however, extracted temporal expressions are not accurately described in the text, i.e. the expressions denote an imprecise amount or point in time, as in “about 3 months ago”, “less than a year”, “few days”, and “recently”. More than 30% of temporal information in some text types, e.g. clinical notes, can be imprecise, affecting, for example, the results of searches for events related to such temporal data. In addition, an inaccurate interpretation may yield different values for the same expression. For this reason, for a given application, it is important to estimate standardised values for the existing imprecise timexes, i.e. to normalise them.

TimeML [34] is the major initiative for temporal information annotation being an ISO standard since 2010. It is designed to connect the processes of temporal analysis of a text with a representation and formal meaning of time, providing a model and annotation scheme for temporal information in text, including the TIMEX3 scheme for representing temporal expressions. Although TimeML is capable of describing imprecise timexes in terms of language structure, it does not provide mechanisms to correctly normalising them. Therefore, the normalisation of imprecise temporal data in terms of values can be ambiguous or incomplete, e.g. it provides one mod attribute that allows the modification of expressions, but only in a very constrained way (12 preset non-disjoint modifiers). In order to overcome this lack, existing approaches [20, 32, 38, 39] use fuzzy sets to represent individual timexes and relations. However, they describe specific historical events or generic periods of time (e.g. holidays), relying on external sources of data, such as the result of Internet search queries or image timestamps collected from social media, and they do not provide a generic or reusable methodology for the normalisation of imprecise timexes. In these situations, the normalisation is done based on the extracted time spans, which are often focused on one kind of expression and with restricted interpretation of the timexes, being difficult to be applied to broader domains.

This paper contributes with an analysis of a previously unstudied set of imprecise temporal expressions and presents a novel method for their normalisation and representation. The main contributions are the following:

Imprecise timexes quantification and classification The classification was done based on the expressions extracted from clinical narratives. This classification is used as basis for the presented approach.

Methodology for imprecise timex normalisation We introduce a novel methodology for the normalisation of imprecise temporal expressions extracted from text. Our methodology comprises a set of steps, starting from creating a set of questionnaires used to capture how people interpret vague descriptions of time in text. The questionnaires were designed from scratch, since there is not a data set or standard for evaluation of imprecise timexes. Answers were used as input data, from which we created histograms and fuzzy membership functions (MSF) during the pre-processing step. Then, we applied statistical regression and machine learning (ML) techniques in order to evaluate which would be the most suitable model for each kind of temporal imprecision being evaluated. The result is a grounded probability density function for the period over which the timex was attained. We use F1-score to calculate how similar two membership functions are, and to choose the most suitable representation model for each kind of imprecise temporal expression.

Weighted F1-score We presented a new weighted F1-score variation, called \(\hbox {\textit{F}1}_{3\mathrm{D}}\), that better identifies the relevant differences between two membership functions in terms of confidence, by checking whether the differences are more concentrated in the top or in the bottom when comparing two membership function shapes or two normalisation models. We apply the presented methodology for three kinds of imprecise timexes, and we compare the normalisation models results in English and Portuguese. The results showed that the normalisation models were able to capture the vagueness carried out by the imprecise timexes.

This paper is organised as follows: Sect. 2 presents the background and related work regarding the temporal information extraction and the normalisation of imprecise timexes; Sect. 3 presents a quantification of imprecise expressions comparing clinical and non-clinical domains and proposes a classification for imprecise timexes; In Sect. 4, we propose a methodology for the normalisation of imprecise temporal expressions; Sect. 5 depicts the normalisation models resulted for three types of imprecise expressions and compares the normalisation models for two different languages (Portuguese and English); lastly, in Sect. 6, we present the final conclusions and future work.

2 Background and related work

Time is a primary element that allows us to observe, describe, and reason about what surrounds us in the world, providing a substrate for the human management of perception and action. As a cognitive and linguistic component for describing changes which happen through the occurrence of events, processes, and actions, time provides a way to record, order, and measure the duration of such occurrences [4]. As a pervasive element of human life, the absence of a correct identification of the temporal ordering may result in a bad comprehension, leading to a misunderstanding [14].

2.1 Temporal information extraction

The general process of reading and understanding a text includes the inference about whether the presented situations stand in particular points in time [14]. Organising events in a chronological order is important to find the temporal relations (e.g. before/after relations) among them. Temporal information extraction plays an important role in this respect. Temporal expressions are written in natural language and can refer directly to time points or intervals (e.g. “6 years ago”), serving as anchors for linking concepts and events extracted from the text to a timeline, providing the correct distribution of such extracted elements in time [1]. Nevertheless, this seemingly easy task takes into account a set of complex information involving different linguistic entities and sources of knowledge [14].

The recognition (or annotation) of temporal expressions (timexes) in text is the task of finding the corresponding labels \((y_1,\ldots , y_n)\) to a given input string of tokens \((x_1,\ldots , x_n)\) so that the resulting labelling can be decoded into textual spans that constitute the tokens and denote time in the input string [26]. According to Fagerberg [18], the temporal information extraction process comprises: (a) temporal expressions have to be recognised within some kind of document and extracted from it; and (b) extracted temporal expressions should be categorised and normalised to a canonical form—normalisation is not just a formatting problem, but a task in which the appropriate value of the extracted expression has to be calculated.

TimeMLFootnote 1 [34] became a ISOFootnote 2 standard in 2010, as a language for temporal information annotation, designed to connect the processes of temporal analysis of a text with a representation and formal meaning of time. As a specification language for event and temporal expressions in natural language text, TimeML is able to capture distinct phenomena in temporal markup.

Temporal information extraction approaches are usually focused on recognising temporal expressions in text and normalising those expressions by using a function that transforms the matched expression into a normalised form based on TIMEX3 tags [7, 18]. In Llorens et al. [30], authors use the argument that temporal expression normalisation can only be effectively performed with a large knowledge base and set of rules.

The TempEval series in SemEval (International Workshop on Semantic Evaluation) has been exploring the task of extracting temporal expressions, events, and temporal relations from text, with the purpose of advancing research on temporal information processing. SemEval-2015 Task 6 Clinical TempEvalFootnote 3 [6] and SemEval-2016 Task 12 Clinical TempEval [8] were temporal information extraction tasks over the clinical domain, using clinical notes and pathology reports for cancer patients. Results of TempEval-3Footnote 4 and Clinical TempEval (2015Footnote 5 and 2016Footnote 6) were given in terms of precision, recall, and F1-score [17] relevance measures.

In addition to SemEval TempEval series, the i2b2 Natural Language Processing Challenge for Clinical Records [43] focused on the temporal relations in clinical narratives, attracting 18 participating teams to analyse discharge summaries, annotating time expressions, events, and relations between them.

2.2 Normalisation of temporal expressions

Normalisation of temporal expressions (or Timex Normalisation) is the process of tagging a timex, by setting attribute values that describe that expression in terms of an amount of time or a point in time [27]. The timex normalisation task consists of obtaining the absolute value of a timex regardless of the linguistic expression used [30]. After a timex is recognised, its temporal value must be defined, which means finding the value attribute for such temporal expression. The normalisation process is usually implemented as a rule-based system to overcome some problems, including: (a) the infinite number of possible labels and (b) the large number of ways a calendar value can be expressed in natural language [26].

Current annotation standards are restricted to normalise imprecise timex in terms of language structure or language elements [19, 36, 37, 42]. An expression like “few weeks” is normalised to represent an “undetermined period of time” or an “undetermined number of weeks”, making it hard to connect that expression to a timeline without any numerical value. When improving the normalisation guidelines to consider a timex description in terms of uncertain values or periods of time (e.g. range of values), events related to imprecise timexes can be chronologically placed, and temporal reasoning can be applied.

Although it is relatively easy to recognise temporal expressions using rule-based systems or supervised machine learning approaches, normalisation (interpreting them accurately) is a complex task that requires human knowledge, since any practical approach to timex normalisation requires a handcrafted rule set [30]. Kolomiyets [26] presents a TimeML-based normalisation technique that comprises three sub-tasks:

  1. 1.

    Timex classification: a classifier has to distinguish between four different labels of DATE, TIME, DURATION, and SET, to define the type of time expression, as it is defined in TimeML; a rule-based method performs the semantic analysis of time expression constituents (token labelling), identifying different categories (Table 1) with a comprehensive vocabulary and a set of context-dependent normalisation rules specific for that category.

  2. 2.

    Estimation of temporal values: temporal values are estimated (normalised); this is not considered a difficult task for absolute temporal expressions, because such kinds of timexes contain all components required for calculating the final value. Relative expressions (“last week”, “next month”) also can be represented using ISO standards [23] representation facilities.

  3. 3.

    Aggregation of temporal values: an aggregation of temporal values is performed, when one temporal expression consists of a set of shorter temporal expressions that are obtained by pre-normalisation; in this case, partially estimated values are aggregated to obtain a final temporal value.

Table 1 Timex categories [26]

2.3 Imprecise temporal representation

Considerable effort has been carried out to extract temporal information from natural language texts, allowing question answering systems to deal with more complex temporal questions. However, temporal relationships expressed in natural language are often vague (which is inherently associated with real-world temporal information), and it is necessary to extend traditional temporal reasoning formalisms to cope with this kind of vagueness [39].

In temporal question answering systems, answering a complex question may require decomposing the original question into partial questions, to answer such partial questions and combine the partial answers into the final answer. Temporal questions are an important class of complex questions, in which the accurate representation of the time span of events is essential to the treatment of such complex questions [38].

However, a lot of time information is ill-defined, subjective or uncertain, and the boundaries of time periods can often be vague. Thus, the time span representation should be tolerant of imprecision in temporal question answering systems. Zhou et al. [48] summarised the common types of temporal expressions, based on an exhaustive analysis of 147 clinical records, establishing temporal expression classification from such expressions. Despite including uncertain temporal expressions in the resulted classification, the authors state that the automatic extraction work was hampered by the existence of such expression type. Although TimeML is able to distinguish imprecise temporal expressions, it is restricted to describe imprecision in terms of language structure, clouding later temporal processing. For example, in the sentence “frequent headaches for less than one month”, a patient tries to describe how long a headache has lasted. The corresponding amount of time, however, cannot be accurately defined, due to the modifier “less than”. The target imprecise expression “less than one month” is annotated in TimeML as <TIMEX3 value=“P1M” mod=“LESS_THAN”>. As a consequence, when interpreting this expression and its annotated features, it is not clear whether we should consider each possible number of days between 0 and 30 as equally likely, or whether for example, 20–25 days ago is more likely than 5–10 days ago or even “yesterday”.

The fuzzy set theory is a representation formalism suitable for this purpose, allowing the definition of a gradual beginning and ending of events [32]. A fuzzy set is the basic concept that underlies the fuzzy systems theory [33] and involves capturing, representing, and working with linguistic notions, being employed in those circumstances where impreciseness, unpredictability, and vagueness are in concern. A fuzzy set S is characterised by a membership function A mapping the elements of a (finite or not) domain, space, or universe of discourse T into the unit interval [0, 1]. That is, \(A(t): T \rightarrow [0,1]\) [47]. A membership function A can be defined in different forms, such as triangular or trapezoidal functions, or continuously differentiable curves with smooth transitions, such as normalised Gaussian functions. The height of a fuzzy set S is the largest membership grade of any element in that set (Eq. 1), whereas a fuzzy set S is called normal when \(height(S) =1\), and subnormal otherwise [33].

$$\begin{aligned} height(S) = max\left\{ A(t), t \in T\right\} \end{aligned}$$
(1)

The support of S, supp(S), is the crisp set with all the elements of T satisfying \(A(t) > 0\). Likewise, the core of S, core(S), is the crisp set with all the elements of T satisfying \(A(t) = 1\), whereas its boundary, bound(S), encompasses all the elements of T with membership grades in the range ]0, 1[, as shown in Fig. 1 [16].

Fig. 1
figure 1

Concepts related to a fuzzy set [16]

Although some proposed approaches and systems can identify temporal information in text [5, 15, 28, 41], they do not deal with imprecise temporal expressions, like “a few weeks ago” or “the coming months”, in terms of defining more specific attributes to describe and connect those expressions to a timeline. Such approaches do not implement temporal-related logics to manipulate such inaccurate information, for example, to compare events associated, respectively, with expressions such as “about 2 months ago” and “a few weeks ago”, indicating which one happened before or after [29].

In Nagypáal and Motik [32], a fuzzy interval-based temporal model capable of representing imprecise temporal knowledge is described. It generalises Allen’s [2] temporal relations on intervals, by providing a definition of crisp interval relations based on set theory, and then generalises them to the fuzzy case. The presented temporal model is intended for use in ontology modelling, following a modular semantics pattern which tries to keep the semantics of each model separate and to provide clean interfaces between them. Examining the different properties of the fuzzy temporal relations (like transitivity), one can observe basic inferences even in case of fuzzy intervals.

Schockaert et al. [39] present a framework to represent, compute, and reason about temporal relationships between events that have imprecise time spans, represented by fuzzy sets (fuzzy time intervals). The proposed model preserves many of the Allen’s relations’ properties, and it uses a transitivity table for efficient fuzzy temporal reasoning. The qualitative relations between two fuzzy intervals are defined in terms of the ordering of the gradual beginning and endings of these intervals (ordering of the time points belonging to these intervals). It also defines four basic fuzzy relations to order two time points a and b (long before, before or at approximately at the same time, approximately at the same time, just before). Four basic fuzzy relations are defined to order two time points a and b (long before, before or at approximately at the same time, approximately at the same time, just before).

Schockaert [38] suggests an approach based on fuzzy sets to define the beginning and ending of events and provides a fully automatic procedure which uses statements on the Web to construct the membership functions. To obtain useful statements from the Web, authors used the snippets returned by GoogleFootnote 7 for some automatically generated queries. In most applications, all membership functions are defined by an expert. However, this is considered the first attempt to construct membership functions for fuzzy time periods in an automatic way. Figure 2 shows an example that considers the time span of the World War 2. There does not exist a unique point in time that corresponds to the beginning or ending of this war.

Fig. 2
figure 2

Fuzzy set representing the time span of World War 2 [38]

A similar approach was used in Blamey et al. [10] to represent a temporal expression S by a function f(t), which is a probability density function for the continuous random variable \(T_s\), using photographs uploaded to the photograph-sharing site Flickr.Footnote 8 After collecting a list of timestamps for an specific temporal term, the target is to find a probability density function to provide a convenient representation and smooth the data appropriately. Authors argue that temporal expressions can communicate more than points and intervals, and their cultural meaning is much more complex—often difficult to be precisely defined. Thus, a distributed definition can capture such cultural meaning in a more detailed way, as shown in Fig. 3 for the expression “Christmas”.

Fig. 3
figure 3

Distribution of “Christmas” images on Flickr [10]

Even though the related work described uses fuzzy sets to represent individual temporal expressions and temporal relations, by relying on external sources of data in order to describe specific historical events or generic periods of time (e.g. holidays), the approaches proposed are focused on specific expressions or periods of time, and they do not attempt to create a generic normalisation model to describe imprecision in temporal data among the different kinds of imprecise temporal expressions. Our work does goes further, not tackling exactly the same problem as the related work, and that it is therefore not directly comparable.

In this work, we assume that query times are grounded and known. However, this is in itself a significant task, covered in the literature [25]. Knowledge base population has included a simplified version of the temporal bounding task, with maximum and minimum bounds for start and end times, and a corresponding evaluation scheme [3, 24].

3 Imprecise temporal data in text

Fig. 4
figure 4

Example of an event (\(e_2\)) placed in an imprecise point in time

Considerable effort has been put into the extraction of temporal information from natural language texts, allowing systems to deal with complex temporal questions. However, the temporal intervals expressed in natural language are often vague, making it necessary to extend traditional temporal reasoning formalisms to cope with the vagueness [39]. Imprecise timexes make it hard to evaluate whether events should be included in a query result that involves timeline evaluation.

Figure 4 illustrates the importance of dealing with imprecise points in time. A query system performing searches over extracted events should be able to find those bounded by a certain period of time. Given two events \(e_1\) and \(e_2\), each one is associated with a temporal expression \(t_1\) and \(t_2\), where \(t_1\) is a precise DATE that makes it possible to place \(e_1\) in a specific point within a timeline, and \(t_2\) is an imprecise reference in the form “approximately N days later” which makes it impossible to know the exact day when event \(e_2\) occurred. However, it can be reasoned the \(e_2\) occurred after \(e_1\). Considering a query that performs a search within the period bounded by \(q_b\) and \(q_e\), where: \(q_b< t_1 < q_e\) and \(q_e < t_1 + N\), we can surely affirm that \(e_1\) would be part of the search result. On the other hand, it is not possible to evaluate whether \(e_2\) is part of the same query result, as the numerical reference that surrounds the placement of \(e_2\) within the timeline comprises a degree of vagueness that makes it impossible to say the exact date when \(e_2\) happened.

In this section, we show the motivation of this work by quantifying the number of imprecise temporal expressions found in different corpora. We also propose a classification for imprecise timexes.

3.1 Quantifying imprecise timexes

Table 2 English (En) and Portuguese (Pt) corpora analysed about the occurrence of precise and imprecise temporal expressions
Table 3 Occurrence of imprecise timexes in (a) non-clinical and (b) clinical corpora

In order to understand the relevance of normalising imprecise temporal information in different domains, we analysed a set of three clinical and six non-clinical corpora in English and Portuguese (Table 2) to compare the occurrence of imprecise timexes in both general and specific domain data. We used the HINX system [44] to identify the occurrence of imprecise timexes. HINX asserts a specific annotation feature (\(precision=``imprecise''\)) to identify imprecise timexes, based on a set of rules to identify words, expressions, and specific language structures that represent imprecision.

Table 3 compares the number of imprecise temporal expressions against the total number of timexes in each corpus and shows that imprecise timexes in clinical corpora can reach almost 35% (SLAM corpus, 34.8%) of the temporal expressions. The percentage of imprecise expressions found in newswire was no more than 13% (WikiWars corpus).

Table 4 Occurrence of imprecise timexes by temporal granularity

Table 4 describes the distribution of imprecise timexes in terms of temporal granularity. The temporal granularity is the time granularity used to compose the timex, as DAY in “in less than 15 days”, or UNDEFINED in “more recently”. The set of expressions with granularity YEAR, MONTH, WEEK, and DAY represents more than 60% of the total amount of imprecise expressions in both clinical and non-clinical corpora. Imprecise expressions denoting time (HOUR, MINUTE, and SECOND) represent less than 5% of imprecise expressions in non-clinical data and less than 3% in clinical corpora.

Table 5 Occurrence of imprecise timexes by class in clinical corpora

Finally, Table 5 shows the distribution of imprecise temporal expressions found in clinical corpora according to each of the main temporal classes defined by TimeML (DATE, TIME, DURATION, and SET). The occurrence of imprecise timexes is concentrated on the classes DATE and DURATION for clinical documents. In a similar analysis, we observed the occurrence of imprecise timexes is concentrated on the class DURATION in non-clinical documents.

3.2 Classification of imprecise timexes

We analysed the full set of imprecise expressions found in clinical corpora in order to understand the different ways the imprecision can be expressed in natural language. We defined six main groups of imprecise timexes according to their main language elements:

  1. 1.

    Present Reference (PR): a time reference related to the present, based on the document creation time (DCT) (e.g. “now”, “recently”, “currently”);

  2. 2.

    Modified Value (MV): an imprecise timex comprising a modified precise amount of time (e.g. “approximately 10 days”, “less than a month”);

  3. 3.

    Imprecise Value (IV): an expression built around a certain imprecise amount of time (e.g. “some days”, “several weeks”), or formed with undetermined amount of time, in which granularity is usually presented in the plural, with the absence of numeric values (e.g. “years”);

  4. 4.

    Range of Values (RV): an amount of time defined by boundaries (e.g. “every 3–4 months”, “between 8 and 10 years”);

  5. 5.

    Partial Period (PP): a portion of time within a larger time frame (e.g. “the end of last year”, “middle of January”);

  6. 6.

    Generic Expression (GE): an expression denoting a generic period or amount of time (e.g. “this time”, “at the same time”).

Table 6 Timexes by imprecise type in clinical corpora

Table 6 details the number of imprecise timexes found in each clinical corpus according to the imprecise group. A similar distribution was also observed in non-clinical corpora. We chose to apply and test our proposed methodology starting by the three most representative kinds of imprecise expressions in terms of occurrence (PR, MV, and IV). The PR imprecise type represents more than 50% of imprecise timexes in the clinical domain. However, it comprises expressions devoid of a temporal granularity, requiring distinct questionnaire design and input data representation.

4 Normalisation of imprecise timexes

Normalisation of an imprecise temporal expression depends on how people reason about imprecise information. Reasoning about an imprecise timex in a specific context, such as in clinical text, may depend on a broader narrative analysis and an understanding of the context in which the expression was created. Despite this possible influence of different contexts on the interpretation of imprecise timexes, we present a methodology on how to produce normalisation models for each different imprecision type according to the people’s common cognitive perception of temporal imprecision. Therefore, we collected and pre-processed data on how people interpret vague descriptions of time in text, and we compared different approaches in order to create and select the most appropriate normalisation model.

4.1 Specification of the input data

In order to collect data on how people interpret vague descriptions of time in text, we designed questionnairesFootnote 9 in two different languages (Portuguese and English). The design of the questionnaires was necessary since there is not an available data set/standard for analysing imprecise timexes. Each question aims to capture the perception about an imprecise value for a given imprecise timex, showing a sentence comprising two to three descriptions of time that could be precise or imprecise. The target imprecise timex to be evaluated is underlined. The Portuguese questionnaire comprises 125 questions split into five questionnaires (25 questions each), each question made with modified (in order to guarantee de-identification) sentences found in a set of medical records from the InfoSaude corpus. The English version has a total of 150 questions split into ten questionnaires (15 questions each), each question designed using fictional text to capture the perception about specific imprecise value for a given set of imprecise timexes (non-clinical).

Inter-annotator agreement (IAA) is usually used to measure the quality of a data set, by seeing how closely people agree on some objective task that is assumed to have a definitive answer, e.g. extraction of some phenomenon from text. In such a case, we would expect annotators to converge on a common value, assuming the data quality is high. Although we are asking people to fill in a questionnaire with a subjective opinion (i.e. not asking them to extract an objective fact from the text), we used Fleiss’ kappa [21] as a statistical measure for assessing the reliability of agreement when a fixed number of raters assign categorical ratings to a number of items. The types of questions covered by each questionnaire, average number of answers, and the inter-annotator agreement are detailed in Table 7.

Table 7 Types of questions in each questionnaire and inter-annotator agreement

MV and IV questions in the Portuguese survey asked for a specific number of days, weeks, months, or years (e.g. for “more than 10 days”, one specific number of days should be selected, with options ranging from 7 to 60 days). The same type of question in English asked for a possible range of time (e.g. for “more than 5 days”, a range of days start-end should be selected, with start point ranging from 0 to 40 days and end point ranging from 0 to 60 days). An additional option “more than 60 days” was also included on the questions covering the MV imprecise type. PR questions (“now”, “currently”, “recently”) asked for a temporal granularity that would better describe when the associated event starts. We wanted to test different ways to answer each question, leading to the mentioned differences in the design of each questionnaire in terms of how the answers should be entered. Figure 5 shows examples of questions extracted from the questionnaires in English.

Fig. 5
figure 5

Example of questions used to design the questionnaire in English

As most of the imprecise temporal expressions found in the documents we had previously analysed refer to the classes DATE and DURATION, we considered “1 day” as being the basic and minimal unit of time in the experiments. We used a discrete set of an integer number of days, disregarding granularities having TimeML TIMEX3 type TIME (hours, minutes and seconds).

The Portuguese survey was approved by the InfoSaude Research Committee and submitted to 50 universities in Brazil, covering students and staff member from different departments, from which we gathered a total of 352 submissions—each question had on average 70 responses. The English survey was approved by the University of Sheffield’s Research Ethics Committee and submitted to all student and staff members of an opt-out mailing list in that institution. We gathered a total of 890 submissions in English—each question had on average 90 responses.

4.2 Membership functions

We aim to normalise imprecise expressions through the use of fuzzy membership functions (MSF). The MSF would place an imprecise timex in the timeline with a certain confidence level. In addition, a search result would have additional information indicating the confidence score for each event associated with an imprecise timex. Given a list of MSFs for the same kind of imprecise expression (e.g. of the form “less than N days”), we want to produce a generic model where, given N as an input, the model can calculate the parameters to describe a MSF for all expressions of that type.

We used two types of MSFs in our experiments: trapezoidal (four-point-based) and hexagonal (six-point-based) membership functions. Trapezoidal and hexagonal membership functions were chosen because: a) they are asymmetrical and can have their shapes adapted flexibly to match different patterns, and b) their linear boundaries make them easier to use in terms of computing fuzzy logical and relational operations.

A trapezoidal MSF is defined by a set of four parameters (prsv), such as \(M_4(x) : \mathbb {I} \rightarrow [0; 1]\), and \(p< r \le s < v\). Definition parameters p and v are the boundary limits where the confidence is 0, r and s are the boundary limits where the confidence is 1. When \(r = s\), the MSF shapes like a triangular function. The MSF parameters prsv are equivalent to values abcd in Fig. 1.

Similarly, a hexagonal (six-point-based) MSF is defined by a set of six parameters (pqrstv), such as \(M_6(x) : \mathbb {I} \rightarrow [0; 1]\), and \(p< q< r<= s< t < v\), and additionally the trapezoidal boundaries, q and t are the values where the confidence is 0.5. In this work, we refer to trapezoidal and hexagonal MSFs as by their definition parameters, using the notation \(M_4(x,[p,r,s,v])\) and \(M_6(x,[p,q,r,s,t,v])\).

For each question within the questionnaires, we attempted to best approximate the corresponding \(M_4\) and \(M_6\) membership functions with respect to their definition parameters. For each question, we calculated a histogram based on the number of answers given to each possible option. Then, each histogram was approximated to a trapezoidal and to a hexagonal membership function, using a full search method in order to minimise the approximation error. We looked for the best combination of values for the parameters (prsv) or (pqrstv), and the best MSF height in the y axis, which corresponds to the number of given answers. Figure 6a shows the histogram and trapezoidal function obtained for the expression “less than 30 days” from the survey in Portuguese, defined as \(LessThan_{P30D}(x_{days},[16,19,21,31])\) – parameters (prsv) represent number of days, and the confidence = 1 at the height = 8 in the histogram. Similarly, Fig. 6b presents the histogram and approximated trapezoidal function for the expression “about 3 months” from the questionnaire in English, defined as \(Approx_{P3M}(x_{days},[71,87,92,110])\) – the confidence = 1 at the height = 32 in the histogram.

Fig. 6
figure 6

Histogram and trapezoidal MSF for two imprecise timexes

Fig. 7
figure 7

Unsupervised baseline parameters for IV and MV expressions

4.3 Normalisation models

Table 8 MLP design

We compared different approaches, such as linear regression and multilayer perceptron [9], to model each kind of imprecision. In order to identify which method best models each group of imprecise timex, we explored a diverse set of alternatives. The following steps were performed to analyse the data collected from the questionnaire described in Sect. 4.1:

  1. 1.

    We started by splitting the total set of answers into two data sets (50%:50%) to be used as training and test data sets. Input data collected from the questionnaire was pre-processed. For every question, we calculated the distribution of answers in the form of a histogram. A trapezoidal and a hexagonal membership functions were approximated to describe the given histogram, as described in the previous subsection.

  2. 2.

    For those questions using temporal granularity other than “DAY”, we attempted to use both options when training the models, (a) the original granularity and the numeric value (Val) extracted from the temporal expression as it was with its original granularity (e.g. “3” in “about 3 months”), and (b) the same expression converted to the granularity of days (Day) (e.g. “3” in “about 3 months” was converted to “90 days”).

  3. 3.

    For each expression type, we defined range-based unsupervised parameters to use as baseline, which were arbitrary, manually chosen. Figure 7 shows the unsupervised interval parameters defined for MV and IV questions. Each range [be] was mapped to a \(MSF(x,[b-1,b,e,e+1])\) along the experiments. For the modifier MANY, for example, the range value [6, 8] is equivalent to a MSF(x, [5, 6, 8, 9]).

  4. 4.

    In order to produce a generic model that could be used to calculate any membership function for a given imprecise timex type, we applied four different variations of a linear regression to generalise each one of the parameters used to define trapezoidal (prsv) and hexagonal (pqrstv) membership functions for each given type of imprecise timex: (a) the usual (\(y = a + b * x\)) linear regression (Lin-A); (b) we forced the independent constant a in the linear formula to be equal to zero (Lin-0); (c) the linear regression with the natural logarithm values of each expression (\(ln(y) = a + b * ln(x)\)), in an attempt to map that expression given in terms of years (e.g. “5 years” = “1825 days”) as close to those describing periods of days or weeks (Log-A); and lastly, (d) the linear regression based on the logarithm values was extended to force \(a = 0\) (Log-0).

  5. 5.

    For those timexes comprising imprecise values (IV), we also calculated the mean (MEAN) values of each membership function parameter, combining the normalised values described in 2 (Val and Day) and 4 (Lin and Log).

  6. 6.

    For those timexes comprising imprecise values (IV) and present references (PR), we used the temporal context as input value. We considered the “Temporal Context” as the distance in days between the current date (DCT-document creation time) and the last timex mentioned in the sentence prior to the imprecise timex being evaluated. For the designed questionnaires, DCT was defined as the date when each questionnaire was published. This approach was used in an attempt to evaluate whether the perception of a present reference imprecise timex would be influenced by the temporal context distance.

  7. 7.

    For MV and IV types of imprecise expression, we used a multilayer perceptron (MLP) with the backpropagation algorithm [22] to learn how to return the membership function parameters for a given imprecise timex. We also combined the normalised values described in 2 (Val and Day) and 4 (Lin and Log). We used k-fold cross-validation to select the best model with \(k=4\). The internal MLP structure and learning parameters were chosen in a previous tuning step, after testing and comparing different configuration settings. Table 8 describes features and parameters used in the training step. In order to test the hypothesis that Present Reference (PR) expressions understanding could be influenced by the temporal context, we only tested the linear regression approach for that kind of expressions.

  8. 8.

    In order to evaluate each model, we compared each individual membership function generated by the given model with the equivalent membership functions from the testing data set. We used the areas of each membership function to produce the F1-score (Eq. 2), which defines how much the two functions areas overlap. Partial areas that do not overlap are considered false positive and false negative areas, and the overlap is considered as a true positive area. When \(F1 = 1\), both membership functions are exactly the same, and when \(F1 = 0\), there is no overlap between those given functions. The F1-score for the entire model was calculated using the average F1-score from all the membership functions used to test the model.

    $$\begin{aligned} F1(A,B) = \frac{2 \times CommonArea(A,B)}{Area(A) + Area(B)} \end{aligned}$$
    (2)

    Figure 8 shows two hexagonal membership functions—A(x,[1,3,10,13,14,17]) and B(x,[2,3,5,7,9,14])—and the visual representation of the F1-score between A and B, meaning the percentage of the common area relative to the total area of both functions. In the illustrated example, F1-score resulted 0.6567.

  9. 9.

    Finally, for each type of imprecise timex, we used the average F1-score obtained from all the different expression variations and between the trapezoidal and hexagonal membership functions in order to compare and select the most appropriate normalisation model.

Fig. 8
figure 8

F1-score representation between membership functions A and B—partial areas that do not overlap are considered false positive and false negative areas, and the overlap is considered as a true positive area (See Eq. 2)

The linear regression model is motivated by the hypothesis that some kinds of imprecise temporal expressions (e.g. “less than x days ago” or “in approximately x weeks”) could be linearly dependent on the input amount of time x. Given this hypothesis, the simplest and least data-hungry tools to apply are linear regression and MLP. While SVM offers higher expressivity, it also risks making mistakes with lower amounts of data, and certainly if good results can be found through LR or MLP, this result is strong on its own. Additionally, we contrasted the linear regression results with a nonlinear approach. We adopted MLP as a nonlinear alternative to train normalisation models for each kind of imprecise timex and provided comparisons there.

In order to graphically represent the normalisation models, we developed a chart format where we plot both the testing data and the produced generalisation model. Figure 9 shows how this graphical representation works. Each known membership function produced from the input data (e.g. subfigure in the top left side represents the expression “less than 30 days”) is plotted as a vertical bar, with a dark central area representing the top of the MSF, where confidence is 1—the bottom and the top of each vertical bar represent the MSF limits where confidence is 0. The grey area in the chart’s background is the normalisation model resulted for the expression type “LessThan”. Thus, when we need to normalise an unknown expression, the normalisation model will give us the parameters that describe the corresponding MSF definition for the given expression type, by taking the limits of each dark and light grey area. For example, the selected red area at the right side represents the limits for an unknown expression “less than 90 days”, which would be defined as trapezoidal MSF \(LessThan_{P90D}(x_{days},[23,65,85,96])\). Other examples of known MSFs represented in the same figure as vertical bars include “less than 10 days”, “less than 2 weeks”, and “less than 2 months”—the figure shows ten MSFs corresponding to the test data set for the given type of imprecise expression.Footnote 10

Fig. 9
figure 9

Graphical representation of a normalisation model

5 Evaluation

In this section, we present the resultsFootnote 11 of the analysis for the evaluated imprecise types (MV, IV, and PR), based on the representation model described in the previous section. We have performed a statistical hypothesis t-test for verifying the significance of the F1 scores reported for each approach. The significance threshold was set at 0.05.

5.1 Modified value (MV) expressions

Table 9 compares the results of each model used to produce trapezoidal (\(M_4\)) and hexagonal (\(M_6\)) membership functions for the group of expressions comprising “less than”, “more than”, and “approximately” subtypes for both languages (Portuguese and English). Different models are compared using the average (Avg) score between \(M_4\) and \(M_6\). We highlight in boldface the best Avg score for each approach (regression and MLP) in each language (English and Portuguese).

Table 9 F1-scores for MV temporal expressions in Portuguese and English

The Log-A variation achieved the best score for this kind of expression among all the linear regression variations for both languages. The MLP approach produced a result that is better than the Log-A regression variation in Portuguese. However, MLP achieved a result that is similar to the baseline in English.

For both languages, the t-test evidences significant differences when comparing the best MLP against the best regression F1 scores, considering the significance threshold set at 0.05: (a) in English, p value=0.000481 when comparing the results between regression-Log(A) and MLP-Val/Log approaches; (b) in Portuguese, p value=0.003243 when comparing the results between regression-Log(A) and MLP-Val/Lin approaches. In addition, we also compared the results between regression-Lin(0) and regression-Log(A), from which we found no significant differences for both languages (p value=0.183702 for English; p value=0.314776 for Portuguese. The Lin(0) variation does not rely on logarithmic transformations, and this model can be directly calculated by applying simple linear transformations on the input imprecise expression.

Fig. 10
figure 10

Generalisation of “less than X days” expressions within the period of 0–90 for two different approaches in English

Figure 10a shows the model using the Log-A linear regression variation, and Fig. 10b shows the model from MLP-Val/Log, both used to produce trapezoidal functions for expressions of the form “less than N days” in English. The MLP model is consistent when producing membership function parameters that are inside the limit boundaries used to train the given model. However, it is not consistent when trying to produce membership function parameters that are outside those limits. For instance, it finds values for the parameters r and s that are greater than N for “less than N days” for each \(N > 60\) (darker grey area in the chart). Similar differences between linear regression and MLP approaches were observed in the Portuguese models. Linear regression models are more consistent when generalising MV imprecise timexes.

Although the MLP approach resulted better for one of the languages, its inconsistency when dealing with imprecise expressions outside the limit boundaries used to train the model imposes limitations and restrictions for its use. Lin-0 and Log-A models are more efficient and stable on generalising this kind of temporal imprecision, and their statistical similarity led us to believe the simplicity and straightforward applicability of the Lin-0 model make it strongly recommended to model MV imprecise expressions. In Table 10, we present the factors \([B_p,B_r,B_s,B_v]\) used to calculate the parameters [prsv] that define trapezoidal MSFs for MV temporal expressions in both languages for a given amount of time in a temporal granularity (\(N_{tgran}\)). For example, the expression “less than 30 days” in English is defined as:

$$\begin{aligned} \begin{aligned} LessThan_{n_{days}}&= MSF(x_{days},[n*0.3693,N=n*0.7964,n*0.9371,n*1.0803]) \\ LessThan_{30days}&= MSF(x_{days},[30*0.3693,30*0.7964,30*0.9371,30*1.0803]) \\&= MSF(x_{days},[11,23,28,32]) \end{aligned} \end{aligned}$$
(3)
Table 10 Linear regression factors \([B_p,B_r,B_s,B_v]\) used to produce the parameters [prsv] that define Lin-0 trapezoidal MSFs for MV expressions in Portuguese (Pt) and English (En)

5.2 Imprecise value (IV) expressions

Table 11 compares the results of each model used to produce trapezoidal and hexagonal membership functions for the IV type of temporal expressions. Linear regression and MLP methods used the distance in days (Temporal Context) to the last precise temporal expression found in the text prior to the target imprecise timex as an input parameter when creating each model. We used two MLP approaches: (a) one to learn each temporal granularity (“days”, “weeks”, “months”, “years”) and (b) one to learn each imprecise value (“few”, “some”, “many”, “several”). We highlight in boldface the best Avg score for each approach in each language.

Table 11 F1-scores for IV temporal expressions in Portuguese and English

The best average F1-scores for each evaluated method are similar in each language (ranging from 0.76 to 0.79 in Portuguese, and from 0.84 to 0.88 in English). The best average F1-score was achieved by the MLP model trained based on granularities in Portuguese and by the linear regression (Val/Lin) in English. However, those models do not show any significant difference against the corresponding best model using the mean method considering the significance threshold set at 0.05: (a) in English, p value=0.075434 when comparing the mean Day/Lin and the regression (Val/Lin) approaches; (b) in Portuguese, p value=0.199188 when comparing the mean Val/Lin and the MLP-Val/Lin (granularity) approaches. The main advantage of the mean approach refers to the fact it can be applied independently of an input value or temporal context.

Fig. 11
figure 11

Hexagonal membership functions for IV imprecise timexes comprising modifiers “few”, “some”, “many”, and “several” combined with distinct temporal cardinalities (“days”, “weeks”, “months”, and “years”)—y-axes are the result (\(\mu \)) of each MSF, and x-axes represent the amount of time in the same temporal granularity corresponding to the label of each chart

Figure 11 shows the hexagonal functions created by the method Mean(Day/Lin) for the IV timexes in Portuguese and English. Table 12 presents the parameters [prsv] used to define trapezoidal MSFs for IV temporal expressions in both languages. The set of parameters [prsv] is given by the granularity value (Val/Lin) and by the absolute number of days (Day/Lin). For example, the expression “few weeks” in English is defined as \(FewWeeks(x_{weeks},[1,3,3,9])\) (by the Val/Lin approach) or as \(FewWeeks(x_{days},[9,20,25,59])\) (by the Day/Lin approach). Note that the approaches Val/Lin and Day/Lin produce MSFs with different temporal granularities, respectively, identified by \(x_{weeks}\) and \(x_{days}\) in each MSF definition. The former describes imprecision in the same temporal granularity as in the original expression; the latter always expresses the probabilistic distribution of a imprecise temporal expression in number of days.

5.3 Present reference (PR) expressions

Present reference (PR) imprecise timexes comprise those expressions including “currently”, “recently”, and “now”. For this kind of imprecise timexes, we asked people to choose the most appropriate option to express the amount of time since when the event associated with the target expression occurred. Figure 12 shows two examples of questions extracted from the English questionnaire. In each question, the target imprecise expression should be defined by another imprecise timex. Options included four IV expressions: “days”, “weeks”, “months”, and “years”.

We calculated the histogram of the given answers for each PR question, and we used the percentage of answers given to each IV expression option to create a combined membership function using a percentage of the parameters extracted from each IV expression. To calculate the linear regression model, we used the percentage of answers given for each PR question in order to produce a generic model based on the temporal context (in days). Figure 13 shows the models for two different periods (50 weeks and 20 years), including the resulted membership functions representation for each PR question in English and Portuguese. Table 13 shows the weights used to combine IV expressions with different temporal granularities in order to produce a membership function that describes each PR expression.

Table 12 Parameters [prsv] produced by the mean approach used to define trapezoidal MSFs for IV temporal expressions in Portuguese and English
Fig. 12
figure 12

Example of questions covering PR imprecise timexes in English

Fig. 13
figure 13

Hexagonal membership function model for PR imprecise timexes

For example, in question number 7 (Fig. 12) the expression “recently” can be mapped to a MSF by combining the IV parameters from the mean approach:

$$\begin{aligned}&M_{recently} = 0.180 \times M_{days} \ + 0.528 \times M_{weeks} \ + 0.281 \times M_{months} \ + 0.010 \times M_{years} \end{aligned}$$

which is equivalent to:

$$\begin{aligned} M_{recently}&= M(x, [ 0.180 \times 1 + 0.528 \times 7 + 0.281 \times 26 + 0.010 \times 239, \\&\qquad 0.180 \times 3 + 0.528 \times 18 + 0.281 \times 92 + 0.010 \times 1125,\\&\qquad 0.180 \times 5 + 0.528 \times 30 + 0.281 \times 134 + 0.010 \times 1498,\\&\qquad 0.180 \times 14 + 0.528 \times 73 + 0.281 \times 296 + 0.010 \times 5550 ] ) \end{aligned}$$

or:

$$\begin{aligned} M_{recently} = M(x_{days},[13,47,70,180]) \end{aligned}$$

PR expressions in English are more linearly dependent on the temporal context than the same expression in Portuguese. That means “recently” represents more in terms of amount of time in English when used in a temporal context of “10 years” than when it is used in a temporal context of “6 months”. On the other hand, the equivalent expression in Portuguese seems to have a similar understanding independently of the temporal context being used.

The linear dependency in English and the nonlinear dependency in Portuguese are confirmed by the statistical t test when considering the significance threshold set at 0.05. We compared the PR models produced by mean and linear regression approaches from IV imprecise expressions: (a) the mean approach is a non-temporal dependent method that uses the mean values obtained from IV expressions in order to compound PR expressions based on the average of distinct IV modifiers; (b) the linear regression approach uses the temporal context as input parameter to produce MSFs. In Portuguese, mean and linear regression approaches do not evidence significant differences when comparing their final scores (p-value=0.438141), while the same models in English present significantly different (p value=0.015191).

We found the set of PR imprecise temporal expressions much more challengeable to model in terms of fuzzy representation. We believe further experiments focused in this specific type of imprecise temporal reference are required in order to better understand the interpretability of each possible PR expression in different contexts.

Table 13 Weights used to combine temporal granularities from IV membership functions to produce the parameters that define trapezoidal MSFs for PR expressions in Portuguese (Pt) and English (En)

5.4 Comparing languages

We compared models created for imprecise temporal expressions in English and Portuguese. We calculated the F1-score between both languages as the average of each F1-score calculated for each expression format for the trapezoidal and hexagonal MSFs. All expressions within the same type were combined to calculate a partial F1-score (e.g. “some days” in Portuguese and the same expression in English) as the average between F1-score for the trapezoidal and hexagonal MSFs. The calculated average F1 score among all the expressions resulted in the similarity between Portuguese and English.

However, when calculating the F1-score using the MSF area, it was not possible to identify whether the differences are more concentrated in the top \((\mathrm{confidence}=1)\) or the bottom \((\mathrm{confidence}=0)\) of such functions. In order to identify how relevant such differences are, we used a variation of F1-score that we called \(\hbox {\textit{F}1}_{3\mathrm{D}}\). We considered each MSF as a tridimensional object, from which the third dimension identifies how deep each MSF is, varying from 0 at the bottom to 1 at the top. Instead of using the MSF areas, we then used the MSF volumes to calculate \(\hbox {\textit{F}1}_{3\mathrm{D}}\) (Eq. 4). Figure 14 illustrates the difference between F1 and \(\hbox {\textit{F}1}_{3\mathrm{D}}\), comparing three MSFs (A, B, and C). A and B have a difference in the top, while A and C have the exactly same difference in terms of area, in the bottom instead. Thus, \(F1(A,B) = F1(A,C) = 0.9655\). When calculating the \(\hbox {\textit{F}1}_{3\mathrm{D}}\), we can observe \(F1_{3\mathrm{D}}(A,B) < F1_{3\mathrm{D}}(A,C)\), which means A and B have differences more concentrated in the top comparatively to the differences between A and C—differences at the top have more influence to decrease \(\hbox {\textit{F}1}_{3\mathrm{D}}\) than differences at the bottom due to the MSF depth.

$$\begin{aligned} F1_{3D}(A,B) = \frac{2 \times CommonVolume(A,B)}{Volume(A) + Volume(B)} \end{aligned}$$
(4)
Fig. 14
figure 14

Contrasting F1 and \(\hbox {\textit{F}1}_{3\mathrm{D}}\) scores used to calculate the similarity between membership functions

We used the following normalisation models to compare the results in English and Portuguese: (a) Log(A) regression models to compare MV expressions; (b) mean models to compare IV expressions; and (c) Lin(A) regression models to compare PR expressions. Table 14 shows the F1 and \(\hbox {\textit{F}1}_{3\mathrm{D}}\)-scores between English and Portuguese. We can observe \(F1 > F1_{3\mathrm{D}}\) for all the three types of imprecise temporal expressions analysed, indicating that differences tend to be concentrated more closely to the top of the MSFs, where the confidence is higher, and differences can be considered more relevant.

Table 14 F1 and \(\hbox {\textit{F}1}_{3\mathrm{D}}\)-scores between Portuguese and English

6 Conclusions

We have presented an analysis of previously unstudied imprecise time expressions (timexes) in text. This analysis helps to address the overall problem of dealing with temporal expressions in information extraction. Our work introduces three novel techniques for this analysis. First, we provide a novel classification of imprecise timexes. Second, we develop a novel methodology to obtain membership functions for timexes, based on human interpretation of imprecise timexes. Third, as well as the usual F1-score for evaluation, we introduce a novel metric for identifying the differences between membership functions, along three dimensions—the \(\hbox {\textit{F}1}_{3\mathrm{D}}\). Our models were applied to both English, and for the first time, to Portuguese expressions.

The resulting models give an insight into the way in which imprecise expressions are interpreted in different languages. For example, the linear regression-Log(A) membership function that defines the expression “less than 90 days” in Portuguese includes possible interpretations—albeit at a low level of confidence—of 91 to 95 days. This leads us to believe that temporal imprecision is not mathematically reasoned and that there is a level of uncertainty that is able to cross the boundary limits defined by the numerical values found within the temporal expressions.

In future work, we plan to perform experiments to obtain normalisation models corresponding to the other types of imprecise expressions (PP, RV, and GE), and examine whether the differences between languages can be influenced by the knowledge domain or by cultural differences. We also plan to further examine the relation between the F1 and \(\hbox {\textit{F}1}_{3\mathrm{D}}\) scores and compare their interpretability against other probability distribution divergence metrics, such as the Kullback–Leibler (KL) divergence. Additionally, we plan to compare the membership function models against other probabilistic representations (e.g. Gaussian or gamma distributions) and validate in what extent such probabilistic generalisations are able to mimic the results we found in this work.

Up to 35% of temporal expressions may be imprecise in some domains. By normalising these imprecise expressions, we can greatly increase the amount of extracted events connected to a timeline. We plan to perform search-based experiments over the extracted events from medical records, in order to provide an extrinsic evaluation of the impact of dealing with such imprecise temporal data on the overall IE process.