1 Introduction

Numerous memory organizations now allow access to their online record archives that span over decades or centuries. The Times Digital ArchiveFootnote 1 or Chronicling AmericaFootnote 2 are two such example collections among many others that contain millions of news reports from several past decades. While in many cases, the documents are collected by scanning and optical character recognition, newly created born-digital documents are increasingly being incorporated in online document archives [1]. This ongoing evolution of digital document archives enables present and future users to gain knowledge about significant historical events by searching and exploring primary sources. News stories are a particularly valuable and appealing kinds of preserved documents since they provide information on the major events and issues of our society in the past. News article archives enable readers to obtain comprehensive event-oriented information from primary sources, which can be beneficial for a variety of purposes, including checking the authenticity of materials created by secondary sources (i.e., fact-checking at the basis) or learning forgotten details about events in the past.

While text indexing and query suggestion methods have been investigated in the context of temporal search [2,3,4,5,6,7], relatively little research has been conducted on archival document access approaches such as document ranking and retrieval. The prevalent technique is to employ the same access strategies as those employed for synchronic text collections. In this context, we suggest that since archives have distinctive temporal properties, successful document retrieval methods must differ from typical information retrieval solutions built for synchronic document collections. Our study seeks then to provide relevant indicators for enhancing information access to document archives and, as a result, the exploitation of our cultural heritage.

When professional users of archived materials, such as historians or archivists, conduct searches or in general interact with archives, they often have a specific goal in mind. In other words, they frequently have rather specific search objectives. Additionally, these users are typically adept at locating essential archived documents. On the other hand, common users may have less defined search intents, frequently desiring to sample only a few documents without having a specific search goal in mind [8], or they may just seek informative or entertaining information. For those users, novel and engaging retrieval technologies are required in order to find interesting and relevant search results. Rather than a direct keyword match, which in the context of archival retrieval is essentially suitable when the search intent is precise and relatively narrow, a novel notion of relevance is needed. Considering the specific characteristics of archival search, we believe that material that is to some degree related to the present time might be relevant and appealing to current users who perform the searches in the archive. We consider the relevance of retrieved results to the concepts and entities known to searchers as natural and desired since users are naturally drawn to familiar content.

In this context, we propose the idea of contemporary relevance for archival records, which refers to their assessed relationship to the present times. The traditional notion of relevance in synchronic document collections tends to be usually reflected by a match between query terms and document content. However, in diachronic document collections such as long-term news archives, even if the returned archived documents are related to the user query (e.g., a query such as “politics,” “soccer,” “commerce”), they may not always match the context and conditions relevant to modern searchers. Archival records are inherently “far” from the present, much more so if they were published a long time ago. We believe that the degree to which materials are related to current events or contexts should be examined in order to deliver attractive and informative search results to non-professional users. Returning documents with a certain connection to the present time should assist users in locating potentially interesting material and may even promote accidental discovery [9].

Furthermore, contemporary relevance used as search criteria sometimes may result in an increased probability of providing useful results when users have some tasks in mind. For instance, a writer working on a story may be more interested in historical records that are directly tied to the current twist of the story than in those that have a tenuous connection to it.

Notably, contemporary relevance is a well-established notion in the science of history, where it is used to describe and evaluate the relevance of historical events and individuals to modern times [10, 11]. This notion is particularly essential in history didactics because it helps students grasp the relationship between history and contemporary society and so may increase their interest in history study.

In Fig. 1, we provide two examples of news stories from the late 1980s to illustrate how signals of contemporary relevance may look like in news articles. While both publications deal with economic matters, the bottom item also discusses “Donald Trump.” Comparing these two texts, we can conclude that the bottom document has a greater possibility of being considered relevant to contemporary times, given that Donald Trump was relatively recently serving as the US President and still continues to be a significant figure in the modern world. Figure 2 illustrates two further document samples, this time from 2007. Both are about “France,” yet the bottom news excerpt discusses citizen demonstrations, while the top piece discusses TGV train speed testing. The earlier text would presumably be deemed more current, given citizen protests in France have recently become fairly widespread and violent, as well as often covered by various news outlets globally.Footnote 3\(^{,}\)Footnote 4 When issuing general queries like “economic issues” or “France” users would more likely be interested in the two latter examples (the bottom documents in both the figures) as these documents are more pertinent to the current times. Take note that these two sets of articles exhibit only a few example, conceivable manifestations of contemporary relevance in historical records.

We demonstrate the last two examples of archived documents from the mid-1990s in Fig. 3. Those documents are returned from the archive when a user queries ”United States trade issue.” Both articles discuss the trade issue between the U.S. and China, yet the bottom one contains mentions of the leaders of both countries at that time (Bill Clinton, Jiang Zemin). Comparing these two documents, we can say that the bottom document could be more relevant to present users since it is about famous historical figures who are still strongly remembered at present.

Fig. 1
figure 1

Excerpt news articles from 1988 (top; available at https://www.nytimes.com/1988/04/12/business/credit-markets-neiman-shifts-key-executives.html) and 1987 (bottom; available at https://www.nytimes.com/1987/01/06/nyregion/anti-peddler-drive-pleases-fifth-ave-merchants.html)

Fig. 2
figure 2

Excerpt news articles from 2007 (top; available at https://www.nytimes.com/2007/04/03/world/europe/03iht-train.4.5130569.html) and 2007 (bottom; available at https://www.nytimes.com/2007/05/07/world/europe/07iht-protest.4.5603711.html)

Fig. 3
figure 3

Excerpt news articles from 1996 (top: https://www.nytimes.com/1996/07/17/business/pacific-officials-fail-to-reach-a-consensus.html, bottom: https://www.nytimes.com/1996/11/13/business/us-to-spur-beijing-on-trade-group-entry.html)

1.1 Approach and contributions proposed

Manually establishing connections and determining the significance of historical records to the present is of course not feasible. Our focus is thus on automatically measuring this type of temporal connection, which we see as a unique indicator of document usefulness in addition to the more classic idea of keyword-based relevance. Thus, how can one select those historical records that pertain to the contemporary time? We believe that there is no single answer to this question. Rather, contemporary relevance is a multifaceted and complex concept that is difficult to represent with a single signal or approach. Then, we suggest and employ a set of different yet intuitive features that are likely to quantify the degree of document correspondence to the current times. When choosing the way to construct the features for document representation, we pay special attention to named entities and event mentions in archived documents, which are crucial in news articles genre, and we also use an external knowledge base such as Wikipedia to determine their relevance to the present. Additionally, we assess the documents’ content similarity to a reference collection of the recent news articles, as well as we extract and anchor the temporal expressions embedded in news articles’ contents. Furthermore, because the semantics of terms used in the past may change from their present meaning, we apply a semantic transformation approach to effectively compare and align the meaning of terms across time.

We next train the learning to rank model using crowd-sourced annotated data. However, a problem remains that gathering annotated data for training is very expensive, especially considering that there should be many diverse queries included to construct a reliable dataset. We then offer and evaluate a strategy for weak supervision to improve the ranking effectiveness and therefore relieve the issue of providing costly annotations. This approach alleviates the burden of obtaining hand-labeled datasets, and can be applied for any language and time period. Automatically assigned weak labels are employed with the assumption that older contents are less likely to be relevant to the searchers than newer content. This assumption may not hold for every query or document but when one uses sufficiently long time periods to separate the contents, such an approach should work provided large amounts of data are employed. In the experimental section we demonstrate that while the new annotations are imperfect, they can nonetheless be used to create a strong predictive model.

Finally, the last contribution of this paper is the development of working prototype with some of selected features which works in an unsupervised approach. The prototype runs on the top of the Portuguese Web Archive and ranks news articles in relation to user queries (query-based ranking approach) as well as to the query-conditioned notion of contemporary relevance. We discuss the technical choices made to set such a working online demo system and demonstrate some selected results.

To summarize, our study makes the following contributions:

  1. (1)

    We define and discuss the contemporary relevance in relation to archived documents. By proposing to introduce the concept of document relatedness to the searcher’s time into archive search techniques, we want to improve current ways for accessing historical document collections.

  2. (2)

    To accomplish this goal, we develop a learning to rank model that incorporates a variety of unique yet intuitive features.

  3. (3)

    We propose and implement a concept of weak supervision in which papers from the near past are automatically presumed to be more valuable than those from the distant past.

  4. (4)

    We successfully test our approach using annotated news articles and provide deeper discussion on the characteristics of contemporary relevance and its application in archival search.

  5. (5)

    Finally, we demonstrate a working prototype of a reranking system which is employed on the top of the Portuguese Web Archive. We discuss implementation details and prove that it is not difficult to implement the concept of contemporary relevance in archival retrieval for any major language, without having any training data.

This paper is an extended version of our earlier paper which was presented at the JCDL2021 conference. In the current paper, we provide additional experimentation, extend our discussions of the proposed concept of contemporary relevance, and demonstrate a prototype proof-of-concept system.

The remainder of this paper is arranged in the following manner. The second section addresses similar works to ours. Section 3 discusses the proposed features for our approach and their implementation techniques. Following that, in Sect. 4, we outline the experimental design. Section 5 summarizes the evaluation findings for the suggested strategy and describes our method’s two-stage learning process. In the next section we describe the developed prototype of the reranking system. Finally, in Sect. 8, we summarize the study and discuss our future goals.

2 Related work

In this section, we explain how estimating contemporary relevance differs from the previous researches. We also relate our proposal to the concepts of newsworthiness, archival informatics and digital history, as well as we discuss the common learning to rank models.

2.1 Topic detection and tracking

Topic Detection and Tracking (TDT) started as an early initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories [12, 13]. The main difference of our task is that archival documents are not necessarily part of any longer ongoing story that would be still developing. Hence, TDT and other related and more recent initiatives based on story linking (e.g., the background linking task [14] in TREC news trackingFootnote 5) are not applicable. Furthermore, in the case of the contemporary relevance estimation task, our focus is specifically on past articles rather than on recent articles, such as ones collected from online news streams.

2.2 Temporal information retrieval

A fundamental problem in information retrieval is to estimate relevance score of a document from an underlying document collection, given a user query [15]. Traditionally, various ways to measure semantic match between the query and document have been established including BM25, TF-IDF and so on. More recently, neural ranking models [16] have become common in modern information retrieval.

There has been recently growing interest in extracting temporal information from the content of documents as well as from queries for the purpose of exploiting it as a search context. Such temporal signals, both from queries and documents, have been progressively included into the retrieval process, resulting in the development of Temporal IR (T-IR), which strives to improve information retrieval by integrating document and temporal relevance [17, 18]. Numerous research papers have already been offered for ranking documents based on their temporal characteristics [2, 17,18,19,20,21,22,23].

Li and Croft [24] established a time-based language model that prioritizes recent texts based on their timestamp information. Metzler et al. [25] suggested a technique for analyzing query frequency distributions over time in order to deduce implicit temporal information about search queries and use this information to rank results. Campos et al. [26] defined a similarity measure based on co-occurrences of words and years in corpora and a classification algorithm capable of identifying the set of most relevant dates for an implicit time-sensitive query and by it improving the efficacy of the ranked results. Arikan et al. [27] developed a temporal retrieval model that incorporates temporal expressions of document content into the modeling of query-likelihood language. Berberich et al. [28] developed a comparable model, but with the addition of uncertainty in temporal expressions. Kanhabua and Nørvåg [29] developed three distinct approaches for determining the implicit temporal ranges of queries and used temporal information to enhance retrieval efficacy through document reranking. Campos et al. [30] associated appropriate temporal phrases with implicit temporal questions using temporal signals retrieved from document contents.

Our research is also related to the work on recency-based search and assessing the freshness of documents [31, 32], as well as work on recognizing significant time periods in historical archives using diversification approaches [33]. However, in archive collections, one typically does not need to find recent or fresh records. Diversifying search results according to their chronological distribution, as advocated by Singhet al. [33] and Berberich and Bedathur [34], will then not alleviate the problem of locating texts of contemporary relevance.

Exploiting temporality in archived documents can lead to better understanding of old documents and better performance of search engines. There have been research works in the context of archival search recently. Zhang et al. proposed methods to suggest corresponding entities across time as effective queries in archival search [35]. Holzmann introduced an indexing method focusing on anchor texts in a knowledge base [2]. News retrieval is essentially focused on publication dates of news articles, and old news articles are usually not considered relevant, or are at least deemed partly relevant.

The research on estimating the focus time of documents [36, 37] is also related to this work. The focus time of a document is defined as the time period to which the document’s content refers to. For example, a document about the World War II would have a focus time of [1939, 1945]. Note that focus time is orthogonal to the publication time (or timestamp) which is the time when a document was created or published (although the two are often quite correlated, especially in news articles). Focus time could be considered to be used in our proposal such that the past documents that refer to the future (especially the time when user sends the query) could be returned. However, this would be only one particular way of estimating document relation to the present times, while in the current work we discuss several other signals including mentions of currently popular entities or similar events. Furthermore, the notion of focus time is typically more applicable to the relation of documents to the past periods (e.g., documents referring to historical events) rather than to the correspondence of past documents to the present times (e.g., documents referring to future events in relation to their timestamps). Lastly, it should be also obvious that, unlike the concept of focus time, contemporary relevance focuses on the notion of relevance within the realm of Information Retrieval, rather than solely on an independent task of temporal positioning of texts.

2.3 Obtaining access to archival records

As a result of broad digitization initiatives, new fields of interdisciplinary research started to be developed. Digital history [38] and archival informatics [39] are two such innovative fields. However, the majority of present work in these domains focus on digitizing, annotating, organizing and analyzing material, as well as on offering data processing and search approaches in the absence of significant advancements in effective retrieval models. In general, memory institutions such as archives or libraries appear to lack a more in-depth understanding of how to construct successful and appealing services for users [40], and their user engagement techniques are frequently deemed inefficient [41].

This circumstance, we feel, is one of the reasons why the degree of utilization of digital archive collections is still considerably below what would be expected. There is a need for more complex search methods that go beyond simple keyword matching to attract and engage people. Some first initiatives were already made to improve access to document archives. Berberich et al. [4] presented time-travel text search across a versioned document collection in order to efficiently index and rank relevant documents based on the collection’s state at various points in time. Tran et al. [3] developed a time-aware re-contextualization system that automatically adds complementary material from Wikipedia to a sentence in an archive document in order to aid readers in comprehending the content of the past. Pasquali et al. [42] developed a temporal summarizing approach for querying the Portuguese online Archive (arquivo.pt), with the goal of assisting users in making sense of the progression of a specific topic through time. Zhang et al. [43] and Duan et al. [44] advocated mapping historical entities to their contemporary counterparts based on their descriptions in news archive collections. Finally, Jatowt et al. [45] proposed searching for potentially intriguing information from news archives by simulating the sense of surprise elicited by historical content in contemporary readers. The basic concept is that content from the past that is distinct from current content or is unexpected may be appealing to contemporary consumers.

While linking historical data have been recognized as a critical component of increasing the value of archive information [7, 46], the concept of automatically linking it to the present has yet to be achieved. Thus, our idea is novel within the recent synergy of history science and informatics; it is an initiative that is centered on the demands of an end user - the typical consumer of historical knowledge. Establishing the relationship between historical material and contemporary events should significantly boost the value and use of historical data, and may result in increased user interaction with heritage document collections and more effective history teaching.

2.4 Newsworthiness

In journalism and media studies, a significant debate has been ongoing on which events should be chosen and conveyed to readers as news reports. Newsworthiness is the degree to which a piece of news is worth publication, as determined by a certain set of values. Gultung and Ruge presented 12 elements of newsworthiness in their theory of news selection [47]: (1) frequency: the duration of the event, (2) threshold: the event’s effect, (3) unambiguity: the event’s degree of ambiguity, and (4) meaningfulness: the event’s importance in terms of cultural closeness, etc. Additionally, various attempts have been made to determine the newsworthiness of items by automated content analysis. For instance, Di Buono and Snajder [48] examined relationships between newsworthiness and headline linguistic characteristics. De Nieset et al. [49] developed a framework for analyzing an article’s content from six angles, including similarity analysis, named entity identification, and topic detection. Unlike the preceding techniques, which issue time-agnostic ratings to documents, we investigate incorporating the concept of newsworthiness into archives and quantifying documents’ importance to the user’s temporal context.

2.5 Learning to rank

In recent years, more and more machine learning techniques have been used to train the ranking models. Learning to rank algorithms learn the optimal way of combining features extracted from query-document pairs through discriminative training. Liu defines “Learning to Rank” as ranking methods with discriminative models and feature vectors of documents as input data [50]. Feature vectors are designed to reflect the relevance of the documents to the user’s information need. Typical features used in learning to rank include the frequencies of the query terms in the document, Okapi BM25 score, and the Doc2Vec representation of the document.

Liu groups the learning to rank algorithms into three categories of approaches: the pointwise approach, the pairwise approach, and the listwise approach [50]. The pointwise approach takes a feature vector of every single document as input and outputs the relevance degree of the documents. Its ranking model includes regression-based algorithms, classification-based algorithms and ordinal regression-based algorithms. The pairwise approach takes a pair of any two documents as input and outputs the relative order between the documents. Many pairwise ranking algorithms have been proposed, based on neural networks (e.g. RankNet [51]), position-based weights (e.g. LambdaRank [52]), and support vector machines (e.g. Ranking SVM [53, 54]). The listwise approach takes an entire set of documents associated with each query and outputs the relevance degree using the permutation set. The listwise approach can be divided into two sub-categories from the perspective of the loss function. In the first sub-category, the loss function is explicitly related to evaluation measures. The ranking model then optimizes a continuous and differentiable approximation of the measure-based ranking error (e.g. Approximate Rank [55] and SmoothRank [56]), a continuous and differentiable bound of the measure-based ranking error (e.g. \(\text {SVM}^\text {map}\) [57]) or non-smooth objectives (e.g., AdaRank [58]). In the second sub-category, the loss function is not explicitly related to the evaluation measures and instead, the ranking model minimizes the inconsistency between the output and the ground truth permutation (e.g., ListNet [59] and ListMLE [60]).

3 Proposed approach

Figure 4 summarizes our approach. We rank archived documents according to a variety of features that are likely to aid in determining their relationship to the present. This is accomplished by the use of an external knowledge source such as Wikipedia and a reference collection of recent news articles. We then apply learning to rank models for ranking archive contents using a range of proposed features.

Fig. 4
figure 4

Overview of the proposed approach for determining contemporary relevance of archival documents (temporal expression related features are grouped together with the named entity related features)

The following sections will detail the recommended features, as well as the intuitions that underpin their selection. The features are classified into three categories: content-related features, entity-related features, and event-related features. All of the features are normalized using the min-max method.

3.1 Content-related features

3.1.1 Congruence with current news

We begin by introducing a measure of similarity between a target historical document and a sample collection of modern documents (referred to as \(D^{ref}\)), which is used to reflect contemporary news articles. The purpose of this study is to compare historical news items to contemporary popular news stories. We compare semantics using word2vec, a vectorization approach that includes transformation across vector spaces.

Vector space translation is required because of term semantic variations over time, most notably over long time periods. Even though two words at different time periods may be semantically similar, they might have actually little overlap in terms of their contexts (as indicated by their surrounding terms). For example, as demonstrated by Tahmasebi et al. [61] on a large news collection, while it is reasonable to compare iPod to Walkman in the past (e.g., both are portable music devices), the set of the top co-occurring words with iPod in documents published in the 2010s bears little resemblance to the set of the top co-occurring words with Walkman extracted from documents published in the 1980s (only 1 out of 10 top co-occurring terms, which are not stopwords, is same in this case). To compute the transformation between the two distinct vector spaces we employ an orthogonal transformation with a weak supervision technique described by Zhang et al. [5] and also used in [7].

Now we explain how to obtain orthogonal transformation \(W^{*}\) between the two temporally distant periods (\(T^{D}\) and \(T^{ref}\)) of the news articles dataset where, in our case, one represents news articles from some period in the past \(T^{D}\) and the other is representing the collection of news articles in the present time \(T^{ref}\). First, we learn term representations in each of the two periods using word2vec. Naturally, since these are separate learning processes, the features in both obtained vector spaces have no direct correspondence between each other. We will however establish the connection between these two semantic spaces. Note that the across-time transformation we use does not require the existence of intermediary data between the target time periods (which may not be always available) unlike in the case of some methods designed for learning temporal word embeddings [62]. This is beneficial, especially, in cases when the two target periods are quite distant from each other (e.g., the 1970s and 2010s). For example, in [62] the dynamic word vectors are learned simultaneously for all the time units by assuming the term vectors of each two adjacent time units are similar.

Proceeding to describe how to develop a transformation mechanism, more formally, let the transformation matrix W map the words from a vector space \(\theta _{A}\) into another vector space \(\theta _{B}\), and let the transformation matrix Q map the words in \(\theta _{B}\) back into \(\theta _{A}\). Let a and b be normalized word vectors from the news document collections underlying \(\theta _{A}\) and \(\theta _{B}\), respectively. The correspondence between words a and b can be evaluated as the similarity between vectors b and Wa, i.e., \(Corr(a,b) = b^{T}Wa\). However, we could also form this correspondence as \(Corr'(a,b) = a^{T}Qb\). This has also been reported in [63, 64] for the purpose of bi-lingual text translation. Note that orthogonal transformation preserves vector norms, so given normalized vectors a and b, \(Corr(a,b) = b^{T}Wa = |b||Wa|cos(b, Wa) = cos(b, Wa) = Sim_{cosine}(a,b)\). Orthogonal transformation has been also found to perform better than linear transformation in the bi-lingual translation [65]. However, the challenge here is that the training of term pairs for learning the mapping W is difficult to obtain. We adopt here the solution proposed by [5] for preparing the training data. Namely, we use Shared Frequent Terms (SFTs) as the training term pairs. SFTs are very frequent terms in both document collections (e.g., frequent English terms like water, sky, man). Such frequent terms tend to undergo semantic drift only to a small extent across time.Footnote 6 The phenomenon that words that are intensively used in everyday life evolve more slowly than less frequent terms has been empirically tested and confirmed in several languages including English, Spanish, Russian, and Greek [66,67,68]. Using only the shared frequent terms for learning the transformation matrix results in better performance than when using all the terms as in [69].

Given L pairs composed of normalized vectors of SFTs trained in both the document collections [(\(a_{1}\), \(b_{1}\)), (\(a_{2}\), \(b_{2}\)),..., (\(a_{L}\), \(b_{L}\))], we learn the transformation W by maximizing the accumulated cosine similarity of SFT pairs (we utilize the top 5% (18k words) of Shared Frequent Terms to train the transformation matrix):

$$\begin{aligned} \underset{W}{max}\ \sum _{i = 1}^{L}b_{i}^{T}Wa_{i},\,s.t.\,W^{T}W = I \end{aligned}$$
(1)

To infer the orthogonal transformation W from pairs of SFTs \(\{a_{i}, b_{i}\}_{i = 1}^{L}\), we state the following theorem.

Theorem 1

Let A and B denote two matrices, such that the \(i^{th}\) row of (AB) corresponds to the pair of vectors (\(a_{i}^{T}\), \(b_{i}^{T}\)). By computing the SVD of \(M = A^{T}B = U\Sigma V^{T}\), the optimized transformation matrix \(W^{*}\) satisfies

$$\begin{aligned} W^{*} = U\cdot V^{T} \end{aligned}$$
(2)

The obtained orthogonal transformation \(W^{*}\) allows for learning correspondence on the level of terms. Based on it, we measure the similarity between a vectorized term \(v_{A}\) in \(\theta _{A}\) and a vectorized term \(v_{B}\) in \(\theta _{B}\) as follows:

$$\begin{aligned} Comp(v_{A}, v_{B}) = Sim_{cosine}(W^{*}\cdot v_{A}, v_{B}) \end{aligned}$$
(3)

The similarity of a target document d and the reference collection \(D^{ref}\) is done in our case as follows: We infer the orthogonal transformation \(W^{*}\) from pairs of training term pairs \(\{a_{i}, b_{i}\}_{i = 1}^{L}\) which are obtained from D and \(D^{ref}\). A feature vector \(v^{ref}\) of the reference collection \(D^{ref}\) is computed by averaging words after removing stopwords. For each document d in D, we compute a feature vector \(v_d\) of \(d \in D\) by calculating the vector for each word in d and averaging all of them. Finally, the similarity between d and \(D^{ref}\) is:

$$\begin{aligned} Sim(v_{d}, v^{ref}) = Sim_{cosine}(W\cdot v_{d}, v^{ref}) \end{aligned}$$
(4)

where \(v^{ref}\) is the feature vector of the reference collection \(D^{ref}\) computed by averaging word embeddings after removing stopwords and W is the transformation matrix. \(v_d\) is obtained by calculating and averaging the vectors for words in d.Footnote 7

3.1.2 Temporal expressions

We believe that if a text contains temporal expressions referring to times in or near the present, it has a high probability of being related to present. The archived documents may contain projections, forecasts, or predictions about the future (defined as the time period after the documents’ publication dates) which might be interesting to users. As a result, we extract and normalize the time expressions included in each document. This is accomplished through the use of the HeidelTime temporal taggerFootnote 8 [71]. Each detected temporal expression te is represented by a single year \(t_{te}\), which in case of te spanning period of multiple years denotes the mid-year of this period. Then, using the time-decaying function as in Eq. 5, we calculate the score \(\tau _{te}\) based on the distance between the start of \(T^{pre}\), denoted as \(t^{pre}_{s}\), and \(t_{te}\) using the time-decaying function. \(T^{pre} = [t^{pre}_{s}, t^{pre}_{e}]\) is the current reference time period, i.e., it defines the present times and its end-point \(t^{pre}_{e}\) denotes the time of query issuing.

$$\begin{aligned} \tau _{te} = \alpha ^{-\lambda (t^{pre}_{s} - t_{te})} \end{aligned}$$
(5)

Finally, we pick the average and maximum values of \(t_{te}\) calculated on the set of all temporal expressions present in a target text as two document content features for learning to rank.Footnote 9

3.2 Entity-related features

The following feature group is concerned with named entities. Named entities play a significant role in news and should be given special focus. Multiple entities may appear in a single news piece, and the entities that serve only passing roles should not be considered. We then choose and extract significant entities in the content of each document and compute the entity-related features. We use three methods to determine the most significant entities in a document: frequency, TextRank, and offset. Frequency highlights key elements that appear often in the document. TextRank [72] identifies significant sentences that have a high likelihood of capturing the substance of a document, and the entities associated with the highest-scored sentences are judged salient. Offset is used to extract entities that occur at the opening sections of news stories, such as the title or lead paragraph, as these are frequently indicative of the article’s subject. We next select n (\(n=5\) by default) of the most significant entities as determined by each of the aforementioned methods in order to compute entity-related attributes, which include (1) the current popularity of entities; (2) the activity period of entities; and (3) connectivity with existing entities. We utilize Wikipedia as our knowledge basis while determining these features. Note that in prior research Wikipedia was found to be a useful resource for assessing global collective memory attention [73,74,75], and we think it can be equally beneficial for our objective of finding contemporary relevant archived documents.

Finally, we aggregate all the feature values associated with the selected key entities using max and average pooling to get the final collection of entity-related features for each document. The sections below describe our approach to computing the above-mentioned features of entities.

3.2.1 Current popularity of entities

We believe that when named entities which are representative of a particular text are not popular at present, a reader is less likely to consider the document content to be contemporary relevant. On the other hand, popular entities (e.g., Donald Trump mentioned in the examples in the Sect. 1) should be more relevant and attractive to users. We assess the popularity of entities by utilizing Wikipedia page view statistics via Wikimedia REST API.Footnote 10 The named entities are found using TextRazor tool.Footnote 11 We extract the page view count for each entity from the statistics of its Wikipedia article for the last three years. We then compute the entity’s \(e_{i}\) popularity \(popularity(e_{i})\) as:

$$\begin{aligned} popularity(e_{i}) = \ln {pageview(e_{i})} \end{aligned}$$
(6)

Note that the popularity of entities is only one notion of how important or authoritative they are for the current users. It is however a convenient way of measuring what entities from the past matter at the current time.

3.2.2 Activity period of entities

When an article contains entities that are no longer active (particularly those that ceased to be active a long time ago), a reader is less likely to see the content as contemporary relevant. We estimate the activity periods of entities based on this concept. That is, we extract the time periods during which each entity was valid from the DBpedia Linked Data representation. To begin, we gather all properties with a date data type (e.g., xsd : date, xsd : dateTime, xsd : gYearFootnote 12). Then, we extract temporal information linked with predicates that define time intervals (for example, birthDate / deathDate for persons, and foundingYear / dissolutionYear for organizations). We utilize the DBpedia SPARQL endpointFootnote 13 for this. We assess then the activity period of each entity in the entity set \(E_{d} = \{e_{1}, e_{2},\ldots , e_{l}\}\) mentioned in a target document as \(AP_{e_{i}} = \{t_{s}, t_{e}\}\). The degree to which an entity is relevant to the present, represented by \(r_{e_{i}}\), is then computed using the distance between the entity’s activity period and the current reference period \(T^{pre}\):

$$\begin{aligned} r_{e_{i}} = {\left\{ \begin{array}{ll} 1 &{} (if\, AP_{e_{i}} \cap T^{pre}\ne \emptyset ) \\ \alpha ^{-\lambda (t^{pre}_{s} - t_e)}&{} (if\, AP_{e_{i}} \cap T^{pre} =\emptyset ) \end{array}\right. } \end{aligned}$$
(7)

where \(t^{pre}_{s}\) is the starting time point of \(T^{pre}\).

3.2.3 Connectedness to existing entities

When an entity has few or no links to entities that are active in the present, it should have a lower probability of being regarded as relevant to the present. On the other hand, an entity that is connected to a large number of currently active or valid entities is likely to be deemed contemporary relevant. We create the entity graph \(\varvec{G(V, E)}\), where \(\varvec{V}\) is the set of nodes representing entities in the archived documents as well as entities related to them in the knowledge base, and \(\varvec{E}\) is the collection of edges reflecting connections between the members of \(\varvec{V}\). Using the DBpedia Page Links dataset,Footnote 14 we construct the graph \(\varvec{G}\). The connectivity of entities to the present is then determined using the biased random walk given in Eq. 8. \(\varvec{R}\) is a vector holding node relatedness \(r_{v_{i}}\), \(\varvec{M}\) is an aperiodic transition matrix, \(\alpha \) is a decay factor (= 0.85), and \(\varvec{d}\) is a static score distribution vector summing to one and bound to the distances of entity activity periods from the present.

$$\begin{aligned} \varvec{R} = (1 - \alpha ) \varvec{M} \times \varvec{R} + \alpha \varvec{d} \end{aligned}$$
(8)

where

$$\begin{aligned} \varvec{d} = r_{v_{i}}/\Sigma {\varvec{r_v}} \,s.t.\, \Sigma _{i}{d_{i}}= 1 \end{aligned}$$
(9)

3.3 Event-related features

Apart from entities, events are another significant and helpful indicator of contemporary relevance. We propose that when a document contains references to events that are comparable to current events, a reader is more likely to regard the text to be contemporary relevant.

We extract and arrange event data from each document (i.e., who did what to whom). In our approach, event data is assumed to be composed of three codes: source actor, target actor, and action. To identify event references, we employ the widely used conflict and mediation event coding system Conflict and Mediation Event Observations (CAMEO) [76]. The code components are grouped according to state actors, sub-state actor roles, regions, and ethnic groupings. We retrieve the following coded event vector for each event mention: source actorState, source actorRole, target actorState, target actorRole, actionCode. For instance, the line “Obama administration approaching decision on enhancing Iraqi training” generates a coded event vector (USA, GOV, IRQ, None, 01).

We next compute the distance between event vectors \(\varvec{a}\) and \(\varvec{b}\) by using Hamming distance. The distance is represented as

$$\begin{aligned} dist(\varvec{a}, \varvec{b}) = m - \sum ^{m}_{i=0}{\delta (a_i, b_i)} \end{aligned}$$
(10)

where m is the size of \(\varvec{a}\) and \(\varvec{b}\) (\(m=5\) in our case).

The distances between all the possible pairs of events in D and ones in \(D^{ref}\) are then determined. Finally, for each document \(d \in D\), we take the average and minimum values as features.

One of the disadvantages of CAMEO coding is its excessive emphasis on occurrences involving pairs of actors. Natural calamities, for example, may thus be underrepresented. Nonetheless, considering its simplicity, we choose to adopt this way in the current implementation.

3.4 Learning to rank based on weak supervision

We next train the model using the features given in the preceding sections to learn the contemporary relevance scores of documents. To collect sufficiently large training data, we depend on weak supervision (a concept illustrated graphically in Fig. 5). We employ a paired model in which older documents are automatically assigned a lower score than younger documents. This is based on the notion that more recent texts are, on average, more relevant to current times than records from much earlier periods.Footnote 15 While this is not always the case, it should however remain true for a large number of news articles, especially if the archived documents being compared were published in time periods separated by significantly long-time gaps. Intuitively, the recent past is more significant and better recalled than the distant past. This was also proved using large-scale news collections from which past-referential temporal expressions were identified and normalized, indicating a greater forgetting of more distant years than recent years (with a form comparable to an exponential function) [77].

We can then create a large weakly-supervised training set without the need for costly manual annotation using our simple automatic labeling technique. To identify which text is more relevant to the present, we employ a paired strategy to train the learning to rank model. Each learning sample \(s = (\varvec{f_d}, r_d)\) consists of a feature vector \(\varvec{f_d}\) describing a document d and its assigned relatedness score \(r_d\). We assign the weak labels automatically and learn the ranking function using pairwise logistic loss.

Fig. 5
figure 5

The concept of weak supervision for our Learning to Rank model, where documents from the distant past (those released shortly after the collection’s start date and colored in blue) are deemed to be less relevant to the present than more recent documents (published soon before the end date of the dataset and colored in orange). We thus assume throughout the experiments that archived documents released in [2003, 2007] are more current than those published in [1987, 1991]. The documents from 1996 to 1998 are used for testing

In the experiments, we utilize the model trained using the weak supervision technique described above to score news articles whose publication dates do not coincide with the publication dates of the archived documents used for training. In particular, we train the model utilizing documents from two distinct eras separated by a substantial time gap to imply their distinct relationship strengths with the present (12 years, see Fig. 5). The model will next be evaluated using archived documents from the middle of the gap period that separates the two eras (i.e., from [1996, 1998]). The next section gives more information about the experimental settings.

4 Experimental setup

4.1 Document collection

As the underlying historical document collection, we employ the New York Times Annotated Corpus (NYT) [78], which we have indexed using the Solr search engine. The collection comprises more than 1.8 million articles published between January 1987 and June 2007 and has been extensively used in research on Temporal Information Retrieval [17, 18]. To conduct the tests, we partition the dataset into three sub-collections. We rely on 860 k news articles published from 1987 to 1991, 1996 to 1998, and from 2003 to 2007. Articles published from 1987 to 1991, and those from 2003 to 2007 (a total of 650k) are utilized for training based on the weak supervision technique outlined above, while those published between 1996 and 1998 (a total of 210 k) are used as the document collection D for the assessment. We used the TextRazor APIFootnote 16 to extract and disambiguate named entities from each document in order to calculate entity-related characteristics. The API produces a Wikipedia article with a computed confidence score for each discovered entity. We kept entities with a confidence score greater than 0.18, which experimentally was found to perform optimally. Additionally, we crawled 92k articles from the New York Times online archive site that were published in the last three yearsFootnote 17 to serve as the foundation for our reference collection \(D^{ref}\). We created four distinct versions of \(D^{ref}\) based on four distinct representations of \(T^{pre}\): the past month, the last six months, the last year, and the last three years. Finally, we retrieved event data from each article to compute event-related features using the pre-trained machine-coding tool PETRARCH.Footnote 18

We utilized Figure EightFootnote 19 to create the ground-truth dataset, which is a crowd-sourcing platform (currently called Appen) specializing in machine learning annotation. We began by collecting the New York Times Annotated Corpus’s descriptor fields in order to retrieve news articles for annotation. We used 42 of these descriptors as broad queries to collect news articles spanning a variety of domains and topics (e.g., art, elections, finance, sports, international relations, military forces, and AIDS). We then chose the top 50 documents for each query based on their Okapi BM25 scores. BM25 is defined as follows. Given a query q, containing terms \(t_1,\ldots , t_M\), the BM25 score of a document d is computed as

$$\begin{aligned} BM25(d, q)=\sum _{i=1}^{M} \frac{I D F\left( t_{i}\right) \cdot T F\left( t_{i}, d\right) \cdot \left( k_{1}+1\right) }{T F\left( t_{i}, d\right) +k_{1} \cdot \left( 1-b+b \cdot \frac{LEN(d)}{avdl}\right) }\nonumber \\ \end{aligned}$$
(11)

where TF(td) is the term frequency of t in document d, LEN(d) is the length (number of words) of d, and avdl is the average document length in the text collection from which documents are drawn. \(k_1\) and b are free parameters, IDF(t) is the IDF weight of the t, computed by following:

$$\begin{aligned} IDF(t)=\log \frac{N}{n(t)} \end{aligned}$$
(12)

We asked annotators to rate the returned documents on a scale of 1 to 4 for their relevance to the contemporary time as follows:

  • 1: not relevant

  • 2: weakly relevant

  • 3: moderately relevant

  • 4: strongly relevant

We hired a total of 277 annotators (3 per article). To ensure high-quality annotations, we authorized only people with the Type 3 contributor level, which denotes the group of individuals who tend to do the highest-quality work. Additionally, because many of the pieces in the NYT collection are on events in North America, we asked only workers from that region. Additionally, we screened annotators using sixteen test questions, enabling them to do evaluations only if they met a stringent accuracy level. Additionally, we limited the maximum number of annotations produced by a single annotator to eliminate the possibility of biased annotation. In total, we generated 2,080 documents for evaluation and collected 6,240 annotations. Figure 6 shows the distribution of labels.

Fig. 6
figure 6

The distribution of answers

4.2 Queries

As mentioned before queries were taken from the descriptors fields of the New York Times Annotated Corpus. We have randomly selected 42 queries out of the total 1411 of descriptors for conducting the evaluation. They are listed in Table 1. We used descriptors fields which constitute relatively broad queries as it was reported that non-expert users tend to issue queries on general topics rather than on specific ones when searching in document archives [79,80,81].

Table 1 The list of queries used in the evaluation

4.3 Other settings

We constructed the graph \(\varvec{G}\) using DBpedia Page Links dataset.Footnote 20 We used the neural network-based Learning to Rank pairwise approach [82] implemented in TensorFlow RankingFootnote 21 to compute the ranking. As ranking metrics, we used nDCG with \(n = 1, 3, 5, 10\) and pairwise logistic loss as the loss function. Using the trained model, we ranked archived documents from the test collection according to their contemporary relevance. We set \(\lambda =-0.125\) and \(\alpha =0.5\) following [3] to compute the temporal similarity feature and activity period, respectively. To compute event-related features, we set \(m = 5\) since we are using the upper two components of the CAMEO code to maintain a high degree of generality for event kinds.

4.4 Evaluation metrics and methods

We used precision at rank 1, 3, 5, and 10 (P@1, P@3, P@5, and P@10), and recall at rank 1, 3, 5, 10 (R@1, R@3, R@5, R@10, respectively), as well as mean average precision (MAP) at rank 10 as assessment measures. We compared our approach with the following methods:

  • randomized ranking (Random),

  • Okapi BM25 ranking (BM25),

  • TensorFlow Ranking (TF-R) [82] with word2vec as input,

  • TF-R with Doc2Vec as input.

TensorFlow Ranking (TF-R) [82] is an open-source library for developing scalable, neural learning to rank (LTR) models. We used the default ranking model.

The New York Times Annotated Corpus was used to train the word2vec model. By averaging the vectors for each word, we created 300-dimensional feature vectors for each document (without stopwords). The Doc2Vec model was trained on the same 300-dimensional corpus.

5 Results of experiments

5.1 Selecting reference document collection

To begin, we need to optimize the duration of the current time period \(T^{pre}\), as we do not know the optimal time span to utilize as the current reference time. As previously stated, we define the last month, the last six months, the past year, and the last three years as four possibilities for \(T^{pre}\), and then calculate all the features for each of these spans. The performance of the models trained using these time periods is shown in Tables 2 and 3. As it can be seen, the strategy using the last year’s data performs the best. We then utilize this time period as the reference time period \(T^{pre}\) in the remainder of the analysis.

Table 2 Ranking performance in precision and MAP measures for different lengths of the reference time period
Table 3 Ranking performances in recall measure of different reference time periods

5.2 Performance of ranking

Next, we examine the performance of various ranking algorithms and assess the efficacy of the presented approaches. The performance of each approach is summarized in Table 4. We can see that our proposed technique outperforms the other methods. For example, it outperforms BM25 up to 20% with the highest difference for P@5. The low performance of BM25 may result from general and broad queries that are used. Since the highly-ranked documents in the order of BM25 may contain more general terms, we can assume that users may find contemporary relevance in something other than general terms. In comparison to TF-R using word2vec, our technique achieves even higher improvements, with the greatest difference for P@10.

The poor performance of TF-R implies that the similarity of content without vector space translation is an inadequate estimate of current significance. Finally, we see that our suggested strategy outperforms also others in terms of MAP.

Overall, we conclude that utilizing information on entities and events as well as the transformation-based similarity of content is effective to estimate the contemporary relevance of a document. With all the combined features our proposal achieves the best results at almost each evaluation metric.

Table 4 Results for different ranking methods

5.3 The contribution of feature groups

We next undertake ablation research to compare performance when a particular feature group (Entity, Event, or Content) is deleted. As shown in Tables 5 and 6, removing entity-related characteristics resulted in the greatest loss of precision and recall, showing that this feature category is the most important contributor to ranking performance. Content-related aspects appear to be rather significant as well, followed by event-related features. Nonetheless, we may infer that all of the feature groups contribute to the method’s efficacy in aggregate.

Table 5 Results with a given feature group removed

5.4 The efficacy of weak supervision

Following that, we examine the success of the training using our proposed concept of weak supervision and suggest how it might be used effectively with the annotated data. We investigate the following approaches:

5.4.1 Training with weak supervision only (weak supervision)

This is the same model as previously described and tested. The manual annotation data is not used in the training step (being used only for testing).

5.4.2 Training using fivefold cross-validation on manually annotated data (strong supervision)

We separated the manually annotated data into five groups and trained the model using solely the manually annotated data using fivefold cross-validation.

5.4.3 Combining results from the models trained on the weakly annotated and on the manually annotated data by list merging (merged weak + strong supervision)

We were able to integrate both strong and weak supervision by averaging the prediction scores from the two approaches. Each score was first adjusted to fall within the range of 0 to 1.

Table 6 Results of recall with a given feature removed

5.4.4 Training on weakly supervised data after filtering it using a classifier trained on manually annotated data (enhanced weak supervision)

We first automatically annotated the unlabeled documents using an SVM classifier trained on manually annotated data to filter poor quality document pairings. We deleted high-scored articles from the collection’s older section and low-scored documents from the collection’s newer section. The objective was to increase the effectiveness of the weak supervision approach by pre-filtering its data. We filtered out 60% of documents using the validation experiments’ threshold level. Following this filtering, we used the remaining texts for training in the same manner that we did with Weak Supervision.

Table 7 demonstrates that utilizing the idea of weak supervision to collect a significant quantity of training data is beneficial and that the proposed technique performs better than when trained exclusively on manually annotated data (i.e., Strong Supervision). This is most likely due to the small amount of the ground truth data. Additionally, the table demonstrates that our proposed method of combining distantly supervised and manually annotated data via the above-described filtering process (i.e., the Enhanced Weak Supervision method) produces the best results, outperforming a simple list merging method (i.e., Merged Weak + Strong Supervision method). Overall, the results imply that using unsupervised data makes sense and that automatically annotating it following the filtering phase using a supporting classifier trained on small but high-quality data might improve the results further.

Table 7 Results for different models of training with our method

6 Discussions

In this section we provide additional discussions.

In the current work, we used relatively broad queries for experiments. This is because previous studies have shown that users tend to often issue broad and short queries in archival search [80]. When a query is more specialized and includes, for example, the name of an entity or an event, further analysis of the latent search intent of the user may need to be done and the model may need some extension.

Furthermore, our proposal is essentially query independent, since such an approach is quite easy to focus on in the first research step. To leverage query information, further investigation on model selection is necessary. For example, TensorFlow Ranking accepts query features which are independent of the documents in the collection as well as document features [82]. We also need to examine the effect of query features on the training with weak supervision. In the following section, we will demonstrate a simple query-dependent unsupervised solution for contemporary-relevant document retrieval. The demonstrated approach and the working prototype provide also certain explainability-related characteristics since users can learn what kind of features caused a given document to be ranked highly. This is important in order to remain transparent and to let users leave feedback on whether the considered features indeed matter in real search scenarios.

Finally, we should also consider the danger of creating present-related echo chamber or filter bubble. If a user who frequently searches in document archives would always receive only documents that are contemporary relevant to her queries, there is certain risk of failing to present the true picture of the past. Essentially, users would “see” the past only through the “lens of the present”. This could be studied as well.

7 Working prototype

We have implemented the concept of contemporary relevance-based ranking using the online Web Archive portal - arquivo.pt which is providing access to the Portuguese Web Archive. The Portuguese Web Archive (PWA) is the national Web archive of Portugal with a mission to periodically archive contents of national interest available on the Web. The archive stores and preserves information of historical relevance for current and future generations. It is also a service of the Foundation for Science and Technology (FCT).

We have used the API of arquivo.pt in order to collect the top query-relevant documents and then to re-rank them based on the computed degrees of contemporary relevance. The arquivo.pt archive is useful for this kind of research since there is a full-text search access provided through API to the preserved contents. As for now, we have decided to utilize only the top most common newswire sources in Portugal in order to deal with news articles only and to eliminate potential noise.Footnote 22 The usual web pages would pose more complex challenges since some content would be retained while other removed or added over time.

We propose a lightweight unsupervised approach which is based on detecting named entities and scoring them as for their current popularity as well as comparing the content of target archived documents to recent news articles. The named entities are extracted out of each search result using Wikifier APIFootnote 23 [83]. With Wikifier it is possible to jointly perform named entity recognition and named entity linking. Linking to Wikipedia knowledge base is useful since we can later estimate entity popularity using page views logs. An important property returned in the json response by Wikifier is the entity’s Pagerank score obtained by a random walk on Wikipedia link graph. Based on these scores it is possible to take the top n entities out of the document (where n is the number defined by the user ranging between 1 and 50, with the default value being 25) to perform the popularity computation. For each of the top n named entities, its score is then computed representing the average Wikipedia views of the target entity. To get the number of page views for each named entity, we utilize Wikimedia API.Footnote 24 This API has one specific end point just to query the Wikipedia articles’ view counts. It is possible to obtain the number of accesses to a Wikipedia article of a particular entity in a given language (in our case, Portuguese). In order to obtain better results and avoid potential bias because of highly popular entities, we normalize the returned counts.

The second component necessary for ranking returned search results is the similarity to the recent news. For the computation of this similarity, it is necessary to first have access to the recent news related to the user query. Note that the query is provided by a user as an input to the system. To gather a sufficient number of recent news in Portuguese to be utilized as a reference news collection, we harness Mediacloud APIFootnote 25 and collect titles of articles published in Portuguese over the recent year. We use a time frame of a single year counting from the query issuing time because a year has been found to be an optimal time scope. Since dealing with full texts of news articles would be too expensive, only the titles of the relevant news are collected and combined into a single virtual document. Similarity computation is based on a vectorial representation of a target document returned from arquivo.pt whose rank needs to be estimated, and the obtained virtual document which represents news relevant to the user query that were published over the last year. Vectorization is done with the Huggingface SentenceTransformer libraryFootnote 26 using the multilingual embeddings model.Footnote 27 After the vectors are obtained we compare them using the cosine similarity measure.

Finally, we combine the score representing the popularity measure of entities mentioned in each search result with the document’s similarity to the recent news using linear combination with the mixing parameter \(\alpha \). \(\alpha = 0\) denotes the case when only the entity popularity score is used, while the value of 1 means that only the news-based similarity is utilized for document ranking. The computed final scores of all the fetched arquivo.pt search results are later sorted in descending order.

The user interface is simple with only one search box and a side panel prepared to adjust some query parameters (see Fig. 7). The parameters are:

  • the maximum number of documents to be retrieved

  • the year since when data from arquivo.pt should be collected

  • the number of named entities to retrieve from each document

  • the alpha value used for the linear combination

Setting the alpha value empowers a user to decide whether either similarity to recent news or popularity of included entities or rather their mixture determines documents’ ranks.

The user interface is built with StreamlitFootnote 28 application framework which is a very convenient and simple Python framework focused on displaying scientific data. It works well for simple solutions like the current user interface. In Fig. 7 we show the GUI and sample results returned by our system. The results are shown in the order computed by the system and consist of the title, the publication date, and the news source where each returned document was published. We also include the link to the archive, a short text snippet and an expandable box to provide an overview of the top named entities of each document. On the left-hand side pane, a user can set parameters related to the searched query which is input in the text form in the center of the screen. As mentioned before, by setting an alpha value a user decides if the entity-related popularity (the case of small alpha value) or rather similarity to present events (the case of large alpha value) is being used for document re-ranking. Note that \(\alpha \) values range from 0 to 1.

The above-presented proof-of-concept system is a ligthweight version of our proposed methodology for computing the contemporary relevance of archival documents. Although simple and somewhat limited in its functionality and in the range of aspects of contemporary relevance, it is easy to be set up and does not require any training data unlike our main approach described throughout this paper. In the future, we plan however to incorporate the supervised approach on arquivo.pt with all the features outlined in the proposed methodology.

Fig. 7
figure 7

The snapshot of GUI and sample returned results of the built prototype application

8 Conclusions and future research

The purpose of this article is to discuss the relationship of archival records to the present and the effective method for its assessment.

As the first contribution, we present a unique research task for determining contemporary relevance of news articles. Contemporary relevance is proposed as a novel criterion for determining the effectiveness of search within large temporal document collections. We believe that archival information that is not just related to the query in the conventional keyword-matching sense, but also has some relevance to current events and circumstances, can be appealing to users and may have a strong probability of being seen as of high usefulness. Thus, our proposal can be regarded as a step toward a more effective and citizen-friendly utilization of our cultural heritage resources. Additionally, our study aims to spark discussion about the most effective methods for accessing heritage data, especially, considering the continuous growth of the size of nowadays archival document collections. Unlocking the potential that rests in digital archives should also help to justify the enormous amounts of money that are recently spent on digitizing historical documents and constructing archival collections.

The second contribution is a novel technique for determining the present relevance of historical content using Learning to Rank and a variety of specialized features. In its design, we estimate the relatedness of named entities to the present using an external knowledge base, as well as we analyze content similarity enhanced with orthogonal transformation and the scopes of temporal expressions. The experimental assessment reveals that our proposed method is effective. Additionally, we demonstrate that training with minimal supervision is beneficial for document ranking.

The proposed approach can be utilized in conjunction with current methods for document archive access. We believe that this type of search enhancement can also result in serendipitous discovery of useful or interesting content, for example, by allowing previously unknown archived documents to rise to the top of ranked search results in response to current events or changes in the popularity or activity of the embedded entities. Furthermore, the proposed method could even sometimes mitigate the problem of empty result sets (e.g., when a user query contains entities/objects that did not exist in the past - in that case the search engine could still be capable of retrieving some useful content).

Next, we demonstrate a working prototype demo as a proof of concept which uses some selected features to represent and calculate contemporary relevance in an unsupervised way. The objective for showing it and discussing the implementation details is to showcase a simple approach which can be used for ranking based on contemporary relevance without the need for expensive training data.

Our work has several limitations. In the design of our approach, we made several assumptions about the features that matter for contemporary relevance. Although, in our opinion, they are intuitive, the features could be more thoroughly analyzed and tested in subsequent research; for example, through user-focused surveys. Similarly, future studies should measure the interest of archive users in documents with high contemporary relevance in order to assess the level of potential demand for archival search approaches enhanced with contemporary relevance ranking. In the current work, we mainly focused on introducing the concept of contemporary relevance of documents and describing an approach on how to quantify it, under the assumption that the notion of contemporary relevance is useful for typical searchers who access document archives. Furthermore, the task is difficult and requires data annotation which is quite costly, thus we have only used a single dataset for the analysis. Finally, neural ranking approaches [84] that have been popular recently should be tested in the proposed task to see if they could result in better results.

Finally, our analysis identifies numerous promising areas for future investigation. For example, in the future, we intend to develop effective explanations for why highly-ranked articles are deemed relevant to the present, so providing users with more context for the returned results. Additionally, we will examine the relationship between contemporary relevance and the classic concept of relevance, as well as evaluate or even remove some of the assumptions established in this study. Suggesting queries that lead to the discovery of contemporary relevant archived documents is another promising research direction. Finally, we plan to conduct experiments on other document collections that should span larger time frames. We will also investigate actual user satisfaction with the returned results as measured by a user survey consisting of the assessment of search result quality. Ultimately, as also hinted earlier, the idea that the contemporary relevant results are indeed useful and interesting to searchers would need to be tested in practical scenarios, although we believe it is rather natural and intuitive to believe that past documents that are relevant to the present should be especially interesting for average users searching in archival document collections.