1 Introduction

Web archives act as a historical record of the web. The Internet Archive (IA) possesses the largest number of web archive holdings. These holdings are accessible through a set of interfaces to the Wayback Machine. Beyond IA, other web archives exhibit focused collection efforts, often providing unique captures within IA’s temporal and spatial (i.e., URL [8]) voids [29]. A common usage pattern in accessing IA’s captures is to request the archive’s web site at archive.org, submit a URL of interest by providing it in a text input field, then selecting a date and time from the set of available captures for that URL in the past. This pattern may differ between web archives’ respective web interfaces. Memento [39] provides the standards-based interoperable means, dynamics, syntax, and semantics for representing identifiers for archival captures (mementos) from a set of web archives. Each archive that supports the Memento Framework provides an HTTP endpoint for retrieving mementos from their respective archival holdings. Users can send a request for all captures of a URL to a variety of supporting archives through a single endpoint by an accessible tool that performs the logic of querying and combining results from multiple sources—a Memento aggregator.

Fig. 1
figure 1

An aggregator must be configured to supply parameters to an HTTP endpoint (like \(t_1\)), often exhibited in the form of a “templated URI” (\(t_0\)) for a URI-T as shown here. The suffixed red portion represents a URI-R http://example.com as used in practice. This URI templating is replicated (\(m_0\)) with URI-Ms (e.g., \(m_1\)), though a web archive need not identify its captures in this non-opaque manner (\(m_2\) and \(m_1\) identify the same memento)

Memento aggregators typically have reference to a set of endpoints to web archives that implement the Memento Framework. An aggregator may express this through a URI “template” like Fig. 1 or as a URI with an implicit append operation of a URI-R [39]. Upon receiving a request from a client with a parameterized URL (e.g., the URI-R applied to the template URI), an aggregator relays the argument received in this request as parameters for subsequent requests to each archive. When the aggregator receives a sufficient response,Footnote 1 as dictated by the logic of the aggregator in practice, the aggregator combines the results through a procedure that aligns with Memento syntax, often inclusive of temporal sorting.Footnote 2 The aggregator returns this “aggregated” response to the client. This description somewhat encompasses the conventional role of the aggregator. Its place as a means for users to interface with multiple web archives through a single request has the potential to be further utilized, exploited, and be more generally useful.

This paper examines the hierarchical (yet decoupled) relationship between a Memento aggregator and Memento compliant web archives. While an aggregator and a set of archives often exhibit a static one-to-many relationship (respectively), there exists both more fundamental and more potentially complex hierarchies that may be exhibited using existing infrastructure. These exhibitions may be strategically and efficiently enhanced through consideration of this potential additional capability for the sake of enhancing the role of the aggregator in use cases for web archives. We build on existing work in defining a framework for aggregating public and private web archives [25]. Our focus will be on identifying (Sect. 6) and mitigating (Sect. 7) some outstanding issues both introduced by the framework as well as those that exist in current practice of interfacing with web archives using Memento aggregation. This paper constitutes an extension on our prior work [21] with additional discussion, evaluation through implementation, and addressing of further contemporary use cases that have since arisen in the realm of research using web archives.

2 Background

The Memento Framework [39] introduces the ability to perform temporal negotiation on the web by relating the current and past representations of a web page. Past representations are identified by “URI-Ms” and the original representation by a “URI-R”, per Memento. Memento also introduces a resource to associate URI-Ms and URI-Rs through a structured listing called a TimeMap, identified by a “URI-T.” A web archive may return a TimeMap representing its holdings, inclusive of URI-Ms, a URI-R, URI-Ts, and a URI-G for a “TimeGate.” A TimeGate allows a client, through HTTP request headers, to specify a datetime basis for a likewise included URI-R. This paper relates to the information retrieval and relational aspects of Memento TimeMaps and not specifically to the temporal negotiation of Memento, the latter being a feature of TimeGates. We focus on the association of past and present URIs and not the ability to resolve the closest datetime, both of which Memento provides.

Fig. 2
figure 2

“Time Travel” service provides a graphical, web-based endpoint to interface with LANL’s Memento aggregator. After submitting a URI and date range in the interface (a), the results are displayed (b), showing the extent of the captures from a variety of pre-configured, server-defined web archives

The concept of aggregation goes beyond the Memento specification by leveraging a similar structure to TimeMaps but allowing the URIs contained within the aggregated TimeMap to identify resources at multiple archives instead of a single archive. The Research Library at Los Alamos National Laboratory (LANL) deployed the original Memento aggregator [9, 18], currently accessible through a web interface via the Time Travel service at https://timetravel.mementoweb.org/. This web service (Fig. 2a) provides an HTML form field for a user to specify the URI-R and a datetime then uses temporal negotiation to query a set of archives and return links to the results (Fig. 2b).

A central point of access also implies a central point of failure—if the aggregator goes down, no further aggregation may be performed, and users must again resort to querying individual web archives. In response, Alam and Nelson created MemGator [1], a portable, open-source, cross-platform, user-deployable Memento aggregator. This tool enables individuals to no longer solely rely on a single web-accessible aggregator but also configure, use, and potentially deploy their own. Also, unlike Time Travel, a user has the ability to control which web archives are queried for mementos. This newfound ability provided the accessibility of the aggregation capability to be further explored by researchers.

Memento is an extension to the Hypertext Transfer Protocol (HTTP). HTTP is a stateless, client–server based protocol on which the web is built. In the context of Memento, a client provides an HTTP request for a TimeMap of a URI in the past, often by appending a URI-R to a templated endpoint (Fig. 1). Both the identifiers for a TimeMap and a memento are returned with corresponding Link [32] HTTP response headers giving additional context to the representation (Fig. 3). A user (e.g., person) will typically act as a client through a user-agent (e.g., web browser, cURLFootnote 3) and may send an HTTP request to a Memento aggregator with the expectation of receiving an HTTP response. The aggregator, in-turn, acts as a client to the web archives, relaying the request for the URI-R in the past and expects HTTP responses. This use case of a Memento aggregator playing the role of a server and a client is abridged in Sect. 7.4.

The behavior of a user requesting a TimeMap from a Memento aggregator and the subsequent similar request to web archives can be represented as a directed graph. The significance of this representation becomes more apparent when what is typically an endpoint from a web archive is itself an aggregator, which causes the graph to be extended. If the secondary aggregator were to request captures from the initial aggregator, a “cycle” would form and must be mitigated. In Sect. 6.1, we discuss this further, and in Sect. 7.2, we introduce some mitigation techniques for this potential scenario.

Fig. 3
figure 3

A CDXJ TimeMap (top) represents the same content as a Link TimeMap (bottom) including the URI-R (http://matkelly.com, highlighted in red), URI-G (blue), other URI-Ts (green), and URI-Ms (brown) with identical relations (note similarity of the corresponding rel attributes) (color figure online)

Fig. 4
figure 4

An aggregator is configured to query HTTP endpoints (a), which are typically from web archives, but could equally be configured to be to other aggregators causing an “aggregator chaining” effect (Sect. 4.3). Aggregators are agnostic of whether their requester is a client, script, or aggregator itself (b) and thus may send a request that ultimately resolves to a requester causing an infinite loop

3 Related work

Most research involving Memento aggregation relates to usage of the aggregator rather than enhancement of the aggregation process. In the same way that prior to MemGator, researchers would state “we requested URIs from the Time Travel Service,” this statement was transformed to “we used MemGator to request URIs,” indicative that it was useful for researchers to utilize their own aggregator instance [3, 5, 10, 14,15,16,17, 19, 23, 26, 33, 41, 42]. A facet of this use case is the ability for researchers to customize the set of web archives to be used as the basis for querying, which is performed prior to running MemGator by modifying a configuration file.Footnote 4 This paper examines the aggregation process beyond accessing an aggregator and does so at a more abstract level than the ability to customize the archival sources.

3.1 Using aggregators beyond end-user aggregation

As MemGator is free and open-source software (cf. Time Travel), many research endeavors on evolving the aggregation process have centered around enhancing its development beyond the limited endpoint-based Time Travel ecosystem. While the set of archives to be aggregated is static, both in accessing the Time Travel service as well as a deployed MemGator instance, other standards-based mechanisms like HTTP Prefer [38] provide a means of allowing a client to specify the set of archives aggregated to an “enhanced” aggregator—in this case, an extended version of MemGator [22]. This approach [22] entailed encoding the set of archives that normally reside in a server-side configuration file to be customizable at query time. The specification of custom archival sources utilizes the “Prefer” HTTP request header with a value being the self-describing, base-64 encoded JSON representing the aggregator’s configuration of endpoints. A prototypical extension of MemGator referenced by the authors required the aggregator to read the HTTP request header and respond accordingly at runtime to request captures only from the archives specified by the client. This requirement of the aggregator being “enhanced” to this extended capability is discussed further in Sect. 8.1.

3.2 Graph abstractions

The process of HTTP requests as recursively applied through an aggregator subsequently querying additional sources resembles a graph structure, typically reduced to a tree in the conventional case (Sect. 4.2). As this work reiterates the potential for an aggregator querying an aggregator [25], the scenario arises of graph-style cycles (Fig. 4) that must be mitigated. Additionally, we may encounter redundancies in this “chaining” process (Fig. 6) where aggregators down the request chain are configured to query identical, previously queried archives with the same parameters. The similarity of this problem resembles a singly linked list, wherein a child does not know the capacity of its parent and is in adherence of HTTP being stateless. Here, an origin node is aware of that to which it links but a node is likely not aware of the linkages from its parent, to which the node itself is one.

3.3 Aggregation optimization

The process of aggregation can be complex [31], both in programmatic logic to accomplish it as well as largely so in the temporal, spatial, and computational requirements. In conventional practice (Sect. 4.2), upon receiving a request, an aggregator will then send a request to each web archive, as defined by the endpoints in the aggregator’s configuration. The process of sending these requests can typically be performed asynchronously [1], as the response time from a particular archive may be affected by a variety of factors including its infrastructure capabilities, the quantity of its holdings, the temporal spread of its holdings, etc.

Different web archives inherently possess a different set of archival holdings.Footnote 5 For example, an archive may only collect web pages within a limited set of ccTLDs [34] like .ac.uk and .gov.uk for academic and government websites in the United Kingdom (respectively). Repeated requests for TimeMaps from web archives that consistently have no mementos for a structured type of URI produce inefficiencies that are exacerbated when aggregated and affect the aggregation process. AlSum et al. [6] generated profiles to identify the distribution of URIs across archives and the effect on recall by both including and excluding IA from the aggregated results. MementoMap [4] provided an approach to remedy this issue with the cooperation of a web archive. By an archive supplying indexes of its holdings, a “map” can be created to abstractly represent (using wildcards) the extent of the holdings for specific URI patterns. This may be abstracted to the level of TLD (e.g., the extent of the holdings within the .uk TLD) down to the specificity of the quantity of holdings within a specific path of the URI. MementoMaps also provide a format to represent this extent both on the level of URI-R and URI-M. Through the cooperation of one such scoped archive, the Portuguese Web Archive, Alam et al. [4] were able to demonstrate the increase in efficiency of selectively sending requests to a subset of archives informed by their respective holdings. This work leveraged MemGator. Aturban et al. [7], through a longitudinal study on the web archives themselves, identified the disappearance of the base URI of an archive, further highlighting the need for an aggregator to be updated to ensure resolution as archives change their hostnames.

In related work, Bornand et al. [9] consulted logs from the aggregator created by the Time Travel service (the authors are from LANL) to create classifiers to effectively route queries rather than relying on a web archive to provide a profile. They analyzed over 1.2 million URI-Rs from the aggregator’s cache (with over 239,000 URI-Ms) to identify a point-of-compromise for optimizing the requests sent to an archive based on the true and false positive rate as informed by prior requests. This work was extended by Klein et al. [27] to use Bloom filters for defining the extent of archives’ holdings for more efficient retrieval and previously identified response delays for their public-facing aggregator [28].

Part of this work entails enabling the user to have more extensive interaction with web archives using Memento. This is frequently enabled through the use of browser extensions [24, 37] and dedicated applications [20, 30, 40]. MinkFootnote 6 is an extension for the Chrome web browser that allows a user to extend the context of the web page they are currently viewing to be used as the basis of a request to a Memento aggregator. Some preliminary efforts have been performed to provide further user control over archival selection from the web browser using the extension, but have not been formalized nor deployed in the primary extension. Doing so entails either the approach of requiring an enhanced aggregator (Sect. 8.1) that receives a request to adapt their set of archives queried at runtime based on the user’s request (a server-side approach) or for Mink to filter the results on the client after the aggregator returns the results. In the latter, client-side approach, the logic of aggregation becomes the responsibility of the extension when an aggregator does not comply with sending requests to archives outside of its base configuration. In Sect. 8.2, we discuss some changes to Mink to realize the qualities of the aggregator and in-turn, become a purely client-side, browser-based Memento aggregator.

4 Base querying models

Per Sect. 3, Memento aggregators are often configured to be used as a web service; in the case of MemGator, specifying a list of archives, timeouts, etc.; and “used” by querying the aggregator’s HTTP endpoints with the URI as a parameter. In this section, we define aggregator “querying models” for further discussion.

4.1 Proxy-style querying (S0)

An aggregator may be configured to query a single web archive. This is typically not exhibited because of redundancy (i.e., the user would normally just send the request to the archive directly), but serves as a base case for the querying models for further discussion. Here, the “aggregator” acts as a simple relay or proxy between the client and the web archive. This might potentially be useful for specifying a configuration to the aggregator beyond what can be expressed with a request to URI,Footnote 7 e.g., timeouts for a response.

4.2 Conventional querying (S1)

Typical aggregator usage entails a client sending a request to an aggregator that then queries multiple web archives, aggregates the responses, and returns this response to the client (Fig. 5). The internal logic of the aggregator is not necessarily as relevant in defining this model but is critical for an aggregator’s operation. For example, an aggregator may pipeline the requests for more efficient querying. An aggregator also might require archives to respond within a time threshold and “short-circuit” the response to disregard archives that do not respond in time. The abbreviated set of results could then be aggregated based on the subset archives that have responded up to that point in time. Some of these aspects are discussed further in Sect. 7.

Fig. 5
figure 5

A typical use case for a Memento aggregator is for a user to specify a URL and receive a TimeMap representing a list of identifiers (URI-Ms) in the past—S1. Shown here is a Link [32] formatted aggregated TimeMap from MemGator containing a URI-R (line 3 in orange), URI-Ts (lines 4, 12–17 in green), URI-Ms (lines 6–11 in purple) and a URI-G (line 18 in blue) (color figure online)

4.3 Aggregator chaining (S2)

A Memento aggregator may successfully query any endpoint that is Memento compliant. The response from an aggregator is itself also typically Memento compliant. This begets the possibility that what is typically considered a “web archive” configured as an endpoint to query by an aggregator may be an aggregator itself, i.e., an aggregator querying an aggregator (Fig. 4a). One reason this is not typically exhibited is because the set of archives that are queried are (in practice) manually validated before being put in-place in the configuration. In the case of the Time Travel service, there is no indication that an aggregator is queried by the basis aggregator handling the initial response. For MemGator, however, the set of endpoints is user-configurable, and thus this valid scenario may arise and has implications. The merits of “aggregator chaining” were discussed in the seminal work introducing the concept [25], but did not go into detail or highlight some problems that may occur. We reiterate and address these in Sect. 6.

As above, an aggregator may plausibly query a second aggregator. More fundamentally, and problematically, an aggregator can specify itself in its own definition of sources to query. This can be mitigated by the aforementioned manual validation, but the more scalable and programmatic approach might be accomplished through short-circuiting conditional logic in the querying function, i.e., preventing an aggregation web service from sending a request to itself and causing an infinite loop (Fig. 4b). Doing so in the self-referencing case is straightforward but through the indirection introduced through aggregator, an “aggregator-in-the-middle” prevents this logic from being enforced, as a request from a secondary aggregator would be handled as if from any other client. We discuss this problem further in Sect. 7.2.

5 Core features

In this paper, we define approaches to extend the capability of the aggregator abstraction without regard to implementation. This brief but important Section defines the empirical assumptions and expectations currently exhibited by an aggregator. These premises of an aggregator set forth the foundational base cases of expectations of an implementation. We build on these assumptions in Sect. 7.

Expectation 1:

An aggregator must treat web requests received as clients and the requests it sends to archival sources as agnostic of the dynamics of the receiver.

Expectation 2:

An aggregator must treat clients’ requests equally, regardless of whether a requestor is a user-agent, a script, or an aggregator itself.

Expectation 3:

An aggregator is unaware of whether its own configuration incurs any sources queries of its parent.

Expectation 4:

An aggregator must treat clients as stateless and return results from its queries sources.

6 Existing problematic scenarios

What might be deemed as “mis-”configuration of a Memento aggregator may only be exhibited and discoverable upon execution of a request for aggregation. Typical approaches for including a web archive as an aggregation source are (1) the popularity of the archive itself to merit inclusion, (2) manual discovery by those responsible for configuring the aggregator, or (3) efforts toward publicity on the part of the archive itself to make those responsible for the archive’s existence and Memento compliance. There is no established process for an archive to declare the availability of its holdings in an effort to be included in a publicly accessible aggregator [35, 36]. Web archives with restricted holdings may be unsuitable to aggregate for reason of privacy of the holdings [25] or the requirement to limit accessibility beyond the conventional public scope. For example, the UK Web Archive requires a client to be physically on-site to access some of its holdings, otherwise returning an HTTP 451 (Unavailable For Legal Reason) [11] status code.

Aggregators like the Time Travel service also supply TimeGate functionality, allowing for temporal negotiation (per Sect. 2), which is outside of this paper’s scope. As temporal negotiation requires an index for efficient selection (required for scale cf. query time indexing), an aggregator would need to retain the extent of the captures on a URI-R basis from their set of sources. As this is dynamic due to the availability of various archives’ web services, the non-static nature of the set of mementos in an archive, etc., a heuristic-based approach or some form of caching [9] might suffice for “good enough” temporal negotiation. For optimal precision of the representation of sources’ holdings, runtime querying of said sources’ respective indexes produces a more representative result. Thus, the abstraction of a TimeGate service being co-located with an aggregator would still succumb to the effects described in this Section. The remainder of this section describes three effects that can plague current aggregation instances: aggregation cycles (Sect. 6.1), self-reference (Sect. 6.2), and source redundancy (Sect. 6.3).

6.1 When a tree becomes a graph

As an extension of S2 in Sect. 4.3, an aggregator (A) requesting captures from a second aggregator (B) may cause a cycle if the latter aggregator is configured to query aggregator A. This can be mitigated using a few approaches, one of which we describe in Sect. 7.2. Figure 4b illustrates an abstract scenario where this might occur with user-configurable Memento aggregators.

6.2 Self-reference

A simpler example of the abstraction where an aggregator, through the request chain, is requested to respond to a request that it initiated is exhibited in an aggregator’s own endpoints being within its configuration. A web service might be naive of the URI to which it is accessible, blindly sending responses after consuming and processing the parameters in the requests received. Likewise, the solution described in Sect. 7.2 would prevent this from occurring.

6.3 Duplication of sources

The combination of aggregators being user-configurable and the potential for aggregators to query aggregators may result in duplication of results. For example, in Fig. 6, aggregator A queries web archive A, web archive B, and aggregator B. Aggregator B queries web archive A, web archive C, and web archive D. It could be useful for the clients of aggregator A to obtain the results from aggregator B, for instance, aggregator B may be privy to access restrictive web archives C and D. However, the results returned from aggregator B from web archive A will likely be redundant of those requested from aggregator A. Thus, the results may need to be deduplicated. This characteristic may also exist outside of aggregation. For instance, aggregators currently configured to request mementos from archive.org and archive-it.org (both hosted by Internet Archive) will often receive URI-Ms from each archive with precisely the same 14-digit time stamp represented in the URI-M. While it is possible that two services have unique captures (based on the tools used), this requires dereferencing the URI-Ms, which is out of the scope of this paper that focuses on TimeMaps.

Fig. 6
figure 6

An aggregator (A) configured to request captures from a set of sources \(\{S\}\) inclusive of a second aggregator (B) can result with B redundantly querying one of A’s sources, i.e., \(| S_A \cap S_B | \ge 1\)

7 Newfound capabilities

In this paper, we emphasize the contribution of the untapped functional potential of a Memento aggregator beyond simple aggregation. Section 5 outlined the fundamental expectations of an aggregator that are exhibited and must be maintained as core functions. While the logic itself of strategically querying the set of archives with which an aggregator is configured has been explored in other works using profiles or machine-learning (Sect. 3.3), these do not consider the breadth of potential improvements like enabling the client to have further control of the aggregation beyond URI (e.g., using HTTP Prefer [22]), efficiency in returning partial results through HTTP endpoints, and mitigation of a non-curated set of archival sources.

7.1 User-defined set of archives

HTTP provides a standardized means [22] for enabling the end-user (one querying an aggregator through HTTP) to specify the archival sources for aggregation—the HTTP Prefer request header [38]. The value for this header may include an encoded, modified version of the JSON data that is typically used to configure MemGator and contain custom values and transporting through the header. The expectation of an enhanced aggregator is that it will be required to decode this JSON and at its discretion, use that as the basis for the set of archives to query. Some nuances to this approach that have not been explored are (for example) whether the configuration can and should be applied to all users, the rules that should restrict which clients should be authorized to affect this change in the aggregator’s operation, and how to further express the semantics to the extent to which the preference was applied (beyond supplying the Preference-Applied response header).

7.2 Cycle detection

In Sect. 6.1, we introduced the potential for a cycle to occur when Memento aggregators are user-configurable and oblivious to the sources subsequently queried by aggregators further in the request chain. Approaches at mitigating cycles admittedly require the notion of HTTP being stateless to be violated. For instance, including a nonce or unique value to the request and propagating that to the sources queried (whether a web archive or aggregator), and likewise reading this value would allow the process to be short-circuited and provide a requestor some indication that the requestee was a requestor earlier in the hierarchical chain.

Fig. 7
figure 7

Rather than an aggregator waiting for the slowest archival source to respond, the response can be progressively built based on the data received thus far. This response may be served to a client as a preliminary response as indicated by HTTP 202

7.3 Preliminary results streaming

HTTP provides an often unused but standardized mechanism for a server to convey that a request is still processing (HTTP 202 status code) and that a client should wait and check back later [13], often at some indicated amount of time. In the context of Memento aggregation, web archives or other archival sources (e.g., other aggregators per Sect. 4.3), a set of sources from which resources are requested likely returns results in respectively varying amounts of time. This can create a bottleneck while the aggregation service waits for the slowest endpoint to respond but can be optimized by progressively building the result (Fig. 7). MemGator, for instance, merges TimeMaps as they arrive from the requesting aggregator and provide timeouts that can be specified by the user (i.e., the “user” that is executing the MemGator binary—not one making the HTTP request).

Clients making requests to aggregators have some expectation of a balance between correctness, completeness, and efficiency based on their particular use case. As above, the ability to short-circuit a request either after meeting some criteria or threshold (e.g., response time) is not well-explored. As with our previous work in allowing a client to specify the set of archives to use as the basis for aggregation [22], we provide a sample approach in specifying these conditions to an enhanced aggregator in Sect. 8.1.

An important precondition for optimizing aggregators’ processing through streaming is the recognition that Memento does not guarantee nor enforce internal temporal order of the identifiers in TimeMaps. When progressively merging TimeMaps from a partial set of sources requested, the merging process can be performed asynchronously relative to responses being received or more simply, not at all. For an aggregator to wait until all web archives have responded (which may never occur in the case of transient errors at an archive) is temporally inefficient. However, an incomplete (i.e., containing results only from a subset of archives), partially sorted, or unsorted aggregated TimeMap being returned to an end-user, while an aggregator continues to wait can help to inform the end-user of the degree of success thus far. This may be potentially useful in cases where the results of the archives referenced in the aggregated TimeMap are explicit (e.g., through included metadata) instead of needing to be inferred (e.g., zero URI-Ms from an archive might mean no captures). This latter point can be helpful to end-users in making an informed decision to prematurely close the request if the results from an archive, as expressed in the partially aggregated TimeMap, are not to their expectations. This feature is discussed further in Sect. 8.1.

While the ability to return a TimeMap containing results from a subset of archives from which TimeMaps were requested may be useful and more efficient, the temporal burden for an aggregator to sort results is relatively less expensive, as it can be performed asynchronously and progressively. Despite this, partial, unsorted, concatenated TimeMaps returned using either a mechanism of streaming or through the HTTP 202 mechanism allows results, even if intermediate, to be immediately used rather than waiting on a likely unrevealed (to the end-user) set of conditions that are used prior to the response being returned.

7.4 Rescoping the aggregator for client-side execution

In Sect. 2, we alluded to the propagation model, which may itself become recursive, of a client querying an aggregator that then similarly becomes the client through propagation of parameters. With Memento, a user-agent conventionally represents a client, transforming the request to the appropriate format (e.g., HTTP headers) as expected by a server (e.g., an aggregator).

From the client’s perspective, the set of archives that an aggregator queried is not typically revealed. For example, if a client sends a request to an aggregator for https://www.springer.com/journal/799 and receives back a TimeMap containing URI-Ms (Fig. 5), the set of archives represented by the URI-Ms might be representative of the entirety of the set, but that fact is not explicitly conveyed. It is likely and common, because of archival scoping and based on the URI-R provided, that archives within the set queried possessed no mementos for the URI-R and thus are not represented. It is wasteful and temporally inefficient to send requests to archives that possess no captures for a URI-R [25]. A priori knowledge as established by profiling archives of their holdings [2] or more specifically MementoMap [4] helps to mitigate this problem. These advancements allow the set of archives to be strategically defined so requests for URI-Rs that are unlikely to be in an archives’ respective holdings are not requested. However, MementoMap requires archival cooperation and is not foolproof if the index of the captures [9] is not updated to be representative of newly collected captures. It is also heuristic-based, so has false positive built in, i.e., likelihoods may result in no URI-Ms being returned in the TimeMap from an archive that was queried, despite their profile stating that they have captures.

Fig. 8
figure 8

A revised Mink [24] interface (prototype depicted) allows for client-side archival specification but also instills aspects of query precedence and customizing thresholds, among other advanced aggregation concepts introduced in prior work [21, 22, 25] and expanded upon in this paper

8 Implementation

In previous sections, we referenced additional features to which a Memento aggregator would be in a position to exhibit for additional usability beyond server-side, service-controlled, un-customizable aggregation. These features include:

Feature A:

Client-side archival specification (Sect. 7 [22])

Feature B:

Streaming partially aggregated TimeMaps as they arrive from archival sources (Sect. 7.3)

Feature C:

Metadata of archives queried and aggregated (Sect. 7.3)

Feature D:

Cycle mitigation to prevent aggregators from recursively and infinitely querying themselves (Sect. 6.1)

Feature E:

Pure client-side memento aggregation (Sect. 3.3)

As part of the evaluation into the feasibility and potential complications of exhibiting these features, we modified two open source software tools as a proof-of-concept. The first is a modification of MemGator. Because we also recognize that running a command-line service is not a feasible task for some end-users, we have also provided a modified version of Mink. The latter provides the novel aspect of a purely client-side Memento aggregator (Feature E). Note that these changes to the tools are meant to be a proof-of-concept as a reference approach to apply elsewhere and a basis for the functionality to be improved.

8.1 MemGator changes

Prior to this work, the de facto approach of using an aggregator, as discussed in Sect. 3, is to query it through an HTTP request as an end-user. While a “user” who runs MemGator instance can customize the set of archives aggregated upon initializing the service through the MemGator executable, customizing this set continues to not be possible with MemGator releases as of this writing. Our prior work [22] provided a fork of MemGatorFootnote 8 to exhibit the functionality of allowing clients (beyond those running an instance) to specify the set of archives that are queried. We expand on this implementation to provide additional functionality beyond the initial contribution.

Feature B entails serving preliminary results to clients using the HTTP 202 mechanism described in Sect. 7.3 and illustrated in Fig. 7.

The mechanism for providing metadata (Feature C) can be a powerful one that allows researchers to hone their data collection efforts. For example, if requests are sent with URI-Rs of a certain sort, a researcher would not need to infer whether an archive was queried by the aggregator for the results but rather, the aggregator will be explicit about which archives were requested and where responses either contained no mementos or perhaps, the transaction from aggregator to archival source went awry. More standardized mechanisms for expression of metadata can likely be investigated to make this process more useful, usable, and interoperable.

Feature E is irrelevant to MemGator, which expects communication to an HTTP endpoint for a persistent process to be used by multiple clients.

8.2 Mink changes

In Sect. 3.3, we discussed the limitations on the capability of Mink in that (1) it requires an enhanced server-side aggregator (e.g., a MemGator instance as initially described in Sect. 8.1) or (2) it must perform aggregation itself to implement the request for archival selection and specification [22] by the user. In this subsection, we discuss the changes to the Mink codebase to exhibit both of these cases—leverage an enhanced server-side aggregator or ultimately rely on a purely client-side approach for aggregation. Either and both approaches can become usable with these Mink modifications.

9 Discussion and future work

Implicit to this work is the continuous effort to enable the end-user, for which aggregators are typically deployed, to be able to be more specific about that which they would like aggregated. As described in Sect. 3.3, allowing for this degree of interaction with a web service will likely have ramifications to efficiency, for example, caching mechanism may not be beneficial if archival sources vary with each request. For the Time Travel service, this might be moot, as the set of archives queried is controlled server-side. For open-source aggregators, however, which have the potential for extended capability, this process can be further optimized and explored.

There is also the notion of functional cohesion, that is, a service should ideally do one job and do it well. This cohesion is already violated in practice with the addition of TimeGate functionality being co-located with TimeMap querying (i.e., aggregation) endpoints. We hope to see further work done in investigating use cases for both the end-user querying aggregators, researchers deploying their own aggregators, and the functions and processes inherent to the aggregation procedure to enhance the capability to make the aggregation concept generally more usable.

The choice to use HTTP 202 (Accepted), as the semantics are specified [13], requires an additional method of indirection, despite the simplified portrayal in Fig. 7. As defined in RFC 7231, appropriate use of this status code should contain a representation that describes the status of the (aggregation) process and provide a means (e.g., another URI) to obtain more information. For simplicity, our proof-of-concept exhibits somewhat of an abuse of this status code in that the representation being presented from the enhanced, server-side aggregator is a TimeMap representing a partially processed result. This can be improved and in initial exploration, HTTP 206 [12] might seem suitable except for its purpose being for serving a partial response body for a range request, which is not accurate to what an aggregator needs to convey with a partial TimeMap.

To reiterate, one goal of this was to evaluate the state-of-the-art of Memento aggregation and provided an open-ended exploratory investigation and enumeration of the functional potential of Memento aggregators that is not currently being utilized.

10 Conclusion

This paper focused on the aspect of Memento aggregation. We explicitly identified the state-of-the-art in pure server-side aggregators (Time Travel) and user-deployable aggregators (MemGator). Through an aggregator being user-configurable and -deployable, which has proven useful to researchers, other potential issues may arise (Sect. 6) based solely on the current functionality of an aggregator. We proposed further functional extensions to the internal aggregation process and provided reference implementations. To mitigate the requirement for an aggregator to be “enhanced” to this extended functionality, we have also implemented a purely client-side Memento aggregator and bundled the reference implementation within an existing, publicly deployed browser extension. All implementations contributed in this paper are open source with permissive licenses with the intent for further exploration of Memento aggregation beyond simple end-user utilization.

From the perspective of a web service where a client sends an HTTP request to an endpoint, the aspects of this work may not much matter. However, the capacity of aggregators in the status quo still contains untapped potential capability beyond that the typical use case (S1). By enumerating these potential concerns that may arise (Sect. 6) with a user-controlled Memento aggregator, the ultimate goal of enabling a client to have more expression and preference in the process of aggregating web archives will hopefully be improved.