1 Introduction

The web has become ingrained in our lives, influencing our daily activities. Preserving the web through web archives is more important than before. With over 686 billion web pages archived [30] dating back to 1996, the Internet Archive (IA) is the largest and oldest of the web archives. The Wayback Machine, which can replay past versions of websites, is a public service provided by IA. Arquivo.pt [15, 23] has been archiving millions of files from the Internet since 1996. Both web archives contain information in a variety of languages and provide public search capabilities for historical content.

Previous research has predominantly concentrated on examining user behaviors within several domains of the live web, such as e-commerce platforms, search engine interactions, and general website users and usage [20, 52, 61, 65, 67]. A recent avenue gaining attention encompasses exploring user behaviors for security and intrusion detection [27, 29, 53, 57, 66]. This method of analysis serves multiple purposes. It assists in understanding user preferences toward products or events. Additionally, it aids in recognizing potentially suspicious behavior across various online platforms concerning security and privacy by evaluating their specific traits [18]. However, despite these focused studies, there remains a substantial gap in comprehensively exploring user interactions specifically within web archives. It is important to understand accesses to web archives as it provides invaluable insights for maximizing the use of limited web archive resources. It also helps efficient maintenance and organization of web archive data effectively for future use.

Our study is an extension of a previous study by AlNoamany et al. [7] that examined access patterns for robots and humans in web archives based on a web server log sample from 2012 from the Wayback Machine. By using several heuristics including browsing speed, image-to-HTML ratio, requests for robots.txt, and User-Agent strings to differentiate between robot and human sessions, AlNoamany et al. determined that in the IA access logs in 2012, humans were outnumbered by robots 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. The four web archive user access patterns defined in the previous study are Dip (single access), Slide (the same page at different archive times), Dive (different pages at roughly the same archive time), and Skim (lists of what pages are archived, i.e., TimeMaps).

In our initial study [34], we revisited the work of AlNoamany et al. by examining user accesses to web archives using three different datasets from anonymized server access logs: 2012 Wayback Machine (IA2012), 2019 Wayback Machine (IA2019), and 2019 Arquivo.pt (PT2019). In this study, we examined a new dataset of 2015 Wayback Machine anonymized server access logs (IA2015). Using these datasets, we identify human and robot access, identify important web archive access patterns, and discover the temporal preference for web archive access. We add to AlNoamany et al.’s criteria for distinguishing robots from humans by making a few adjustments. These heuristics will be discussed in detail in Sect. 3.4.

The following are the primary contributions of our study:

  1. 1.

    We used a full-day’s worth of four web archive access logs datasets (IA2012, IA2015, IA2019, PT2019) to distinguish between human and robot access. The total number of robots detected in IA2012 (91% of requests) and IA2015 (88% of requests) is greater than IA2019 (70% of requests). Robots account for 98% of requests in PT2019.

  2. 2.

    We looked at different access patterns exhibited by web archive users (humans and robots). We found out that the robots are almost entirely limited to Dip and Skim in IA2012 and IA2015, but exhibit all the established patterns and their combinations in IA2019.

  3. 3.

    We explored human and robot users’ temporal preferences for web archive content. The majority of requests were for mementos [63] that were near to the datetime of each access log dataset, suggesting a preference for the archived content in the recent past.

In this paper, we are attempting to understand who accesses the web archives. To be clear, we are not making any value judgments about robots, because we recognize that not all bots are bad. For example, there are beneficial services like Internet Archive Scholar [50], ArchiveReady [9], TMVis [41], and MemGator [3] that are built on top of web archives. But the needs of interactive users are different from those of robots, and we can better design and implement API access for robots (e.g., [19, 42, 48, 55]) if we better understand how robots are using the interfaces designed for interactive users.

2 Background and related work

Web clients and servers communicate using the hypertext transfer protocol (HTTP) [21]. Web clients (such as a web browser or web crawler) make HTTP requests to web servers using a set of defined methods, such as GET, HEAD, and POST to interact with resources [2]. For instance, the GET method is used to request a resource, the POST method is employed to update a resource with specific information, and the HEAD method is similar to a GET request, but it exclusively requests for metadata without fetching the actual content (payload). Web servers respond using a set of defined HTTP status codes, headers, and payload (if any). The HTTP Status Codes convey the outcome of the request (200 OK, 404 Not Found, etc.), headers provide metadata about the response (content type, server details, etc.), and the payload contains the actual data being sent back to the client (HTML, JSON, images, etc.).

Web server logs are records containing information about requests, responses, and errors processed by a web server. Extracting useful data from web server logs and analyzing user navigation activity is referred to as web usage mining [47, 59, 64]. Numerous studies have been conducted for analyzing different web usage mining techniques as well as to identify user access patterns on the Internet [40, 44]. Web usage mining is used to increase the personalization of web-based applications [46, 51]. Mobasher et al. [45] developed an automatic personalization technique using multiple web usage mining approaches. Web usage mining has also been applied in user profiling [14, 25], web marketing initiatives [10], and enhancing learning management systems [68, 68].

The goal of web archives is to capture and preserve original web resources (URI-Rs). Each capture, or memento (URI-M), is a version of a URI-R that comes from a fixed moment in time (Memento-Datetime). The list of mementos for a particular URI-R is called a TimeMap (URI-T). All of these notions are outlined in the Memento Protocol [49, 63].

In this work, we look at web archive server access logs and perform web usage mining in the context of web archives. There has been past work in how users utilize and behave in web archives [6, 16, 17, 22, 24, 28], including the 2013 study [7] that we revisit. Web archives maintain their web server access logs as plain text files that record each request to the web archive. Most HTTP servers use the standard Common Log Format or the extended Combined Log Format to record their server access logs [8]. An example access log entry from Arquivo.pt web archive is shown in Fig. 1. A single log entry consists of the IP address of the client, user identity, authenticated user’s ID, date and time, HTTP method, request path, HTTP version, HTTP status code, and size of the response in bytes, referrer, and User-Agent (left to right). The request path on this log entry show that this is a request to a URI-M. The client IP address is anonymized in the access log datasets for privacy reasons. Alam has implemented an HTTP access log parser [1], with exclusive features for web archive access logs, which can be used to process such web archive access logs.

Fig. 1
figure 1

A sample access log entry from the PT2019 dataset (Fields: IP address of the client, user identity, authenticated user’s ID, date and time, HTTP method, request path, HTTP version, HTTP status code, size of the response in bytes, referrer, and User-Agent)

AlNoamany et al.’s previous work [7] in 2013 set the groundwork for this study. In addition to their analysis of the prevalence of robot and human users in the Internet Archive, they also proposed a set of basic user access patterns for users of web archives:

  • Dip—The user accesses only one URI (URI-M or URI-T).

  • Slide—The user accesses the same URI-R at different Memento-Datetimes.

  • Dive—The user accesses different URI-Rs at nearly the same Memento-Datetime (i.e., dives deeply into a memento by browsing links of URI-Ms).

  • Skim—The user accesses different TimeMaps (URI-T).

In a separate study, AlNoamany et al. looked into the Wayback Machine’s access logs to understand who created links to URI-Ms and why [5, 6]. They found that web archives were more often used to visit pages no longer on the live web (as opposed to prior versions of pages still on the web), and much of the traffic came from sites like Wikipedia.

Alam et al. [4] describe archival voids, or portions of URI spaces that are not present in a web archive. They created multiple archival void profiles using Arquivo.pt access logs, and while doing so, identified and reported access patterns, status code distributions, and issues such as Soft-404 (when a web server responds with an HTTP 200 OK status code for pages that are actually error pages [43]). While their research is very similar to ours, the mentioned access patterns differ from ours. Their study looks at which users are accessing the archive and what they request, whereas we explain how a user (robot or human) might traverse through an archive.

3 Methodology

In this work, we leverage cleaned access logs after pre-processing raw access logs to identify user sessions, detect robots, assess distinct access patterns used by web archive visitors, and finally check for any temporal preferences in user accesses. The steps of our analysis are shown in Fig. 2. The code [31] and visualizations [32] are published, and each step is explained in detail in this section.

Fig. 2
figure 2

A chart illustrating the phases in our analytical procedure

3.1 Dataset

In this study, we are using four full-day access log datasets from two different web archives: February 2, 2012 access logs from the Internet Archive (IA2012); February 5, 2015 access logs from the Internet Archive (IA2015); and February 7, 2019 access logs from Internet Archive (IA2019) and Arquivo.pt (PT2019). We chose the first Thursday of February for our datasets to align with the prior analysis performed on a much smaller sample (2 million requests representing about 30 min) from the Wayback access logs from February 2, 2012 [7].

The characteristics of the raw datasets are listed in Table 1. We show the frequency of HTTP request methods and HTTP response codes, among other features. HTTP GET is the most prevalent request method (>98%) present in all three datasets, while the HTTP HEAD method accounts for less than 2% of requests.

Due to the practice of web archives redirecting from the requested Memento-Datetime to the nearest available memento, all four of our samples have numerous 3xx requests. IA2012 has about 53% 3xx requests, IA2015 has about 40% 3xx requests, and IA2019 has about 43% 3xx requests out of the total number of requests in the respective samples. About 20% of requests are 3xx in PT2019, due to the same behavior. IA2015 and IA2019 have a higher number of requests to embedded resources (about 63%) followed by IA2012 (44%), whereas PT2019 has only 20%. IA2015 has the highest percentage of requests with a null referrer field (78%) whereas IA2012 has around 48% requests with a null referrer field. The percentage number of requests with a null referrer field has reduced by nearly four times between IA2015 (78%) and IA2019 (20%). There is an increase in the percentage of self-identified robots (SI robots) from IA2012 (0.01%) and IA2015 (0.04%) to IA2019 (0.15%). The percentage of SI robots in PT2019 is as twice that in IA2019. We used some of these features (HEAD requests, embedded resources, and SI robots) in the bot identification process (covered in Sect. 3.4).

Table 1 Features for each dataset: February 2, 2012 from IA (IA2012); February 5, 2015 from IA (IA2015); February 7, 2019 from IA (IA2019); and February 7, 2019 from Arquivo.pt (PT2019)

3.2 Data cleaning

An overview of our data cleaning process is shown in Fig. 2. In the Stage 1 data cleaning (S1), we removed the log entries that were either invalid or irrelevant to the analysis. We only kept legitimate requests to web archive content (mementos and TimeMaps) and requests to the web archive’s robots.txt. The robots.txt requests were preserved since they will be utilized as a bot detection heuristic later on in our process.

After S1 data cleaning, we identified user sessions in each of our three datasets (Sect. 3.3) and conducted bot identification (Sect. 3.4). Stage 2 data cleaning (S2) takes place only after the requests were flagged as human or robot. Our study’s ultimate goal was to detect user access patterns of robots and humans in our datasets, and to do so, we must ensure that the refined datasets only included requests that a user would make. As a result, in S2, we purged log items that were unrelated in terms of user behavior. This includes the browser’s automatic requests for embedded resources, any requests using a method other than HTTP GET, and requests generating responses with status codes other than 200, 404, and 503. Several of these requests, including embedded resources and HEAD requests, were necessary during the bot detection phase. Thus, we had to follow a two-step data cleaning approach.

Table 2 shows the number of requests for each dataset after each cleaning stage. The percentages are based on the raw dataset’s initial number of requests. PT2019 had a higher percentage of requests remaining after S2 compared to IA2012, IA2015, and IA2019. This could be related to the raw dataset’s low percentage of embedded resources (20%) in the PT2019 dataset (Table 1).

Table 2 Number of requests in each of the four datasets (IA2012, IA2015, IA2019, and PT2019): Initial raw data, after stage 1 cleaning, and after stage 2 cleaning

3.3 Session identification

After S1 data cleaning, the next phase in our study was session identification (Fig. 2). A session can be defined as a set of interactions by a particular user with the web server within a given time frame. We split the requests into different user sessions after S1 data cleaning. First, we sorted all of the requests by IP and User-Agent, then identified the user sessions based on a 10-minute timeout threshold similar to the prior study’s process [7]. That is, if the interval between two consecutive requests with the same IP and User-Agent is longer than 10 min, the second request is considered as the start of the next session for that user.

3.4 Bot identification

As the next step in our process, we employed a heuristic-based strategy to identify robot requests (Fig. 2). We used the original five heuristics used in prior work [7] (User-Agent check, number of User-Agents per IP, robots.txt file, browsing speed, and Image-to-HTML ratio) with some minor adjustments to improve the performance of the robot detection. Additionally, we have introduced a new heuristic named “the Type of HTTP request method” to identify robot accesses. The following sub-sections will go through each heuristic in detail. The real-world examples for each heuristic taken from the web archive access logs are shown in the appendices.

3.4.1 Known bots

We created a list of User-Agents that are known to be used by bots. We first constructed \(UA_l\), a list of all User-Agent strings from our three datasets. From this list, we compiled \(UA_m\) by filtering for User-Agent strings that contained robot keywords, such as “bot,” “crawler,” and “spider.”. We compiled a separate bot User-Agent list \(UA_d\) by running our full list \(UA_l\) through DeviceDetector [12], a parser that filters on known bot User-Agent strings. Our final list [33] of bot User-Agents \(UA_{K_b}\) was constructed by combining \(UA_d\) with our keyword set \(UA_m\). Any request with a User-Agent found in \(UA_{K_b}\) was classified as a robot. This heuristic is an adapted iteration of the User-Agent check heuristic from previous work. AlNoamany et al. considered that if a request’s User-Agent matched any of the browsers it was classified as a human request. To ensure the recognition of known bots present within our datasets, we developed \(UA_{K_b}\) specifically to retrieve the most current and updated User-Agents. Appendix A provides a real-world example where the “bot” keyword is available on the User-Agent itself.

3.4.2 Type of HTTP request method

Web browsers, which are assumed to be operated by humans, send GET requests for web pages. Therefore, we used HEAD requests as an indicator of robot behavior and integrated this approach as a new heuristic in our work. If the request made is a HEAD request, it is considered a robot request, and the session to which it belongs is counted as a robot session. Appendix B provides a real-world example where HEAD requests are made.

3.4.3 Number of user-agent per IP (UA/IP)

There are robots that repeatedly change their User-Agent (UA) between requests to avoid being detected. The previous study [7] found that a threshold of 20 UAs per IP was effective in distinguishing robots from humans. This allows for some human requests behind a proxy or NAT that may have the same IP address but different User-Agents, representing different users sharing a single IP. As discussed in Sect. 3.3, we sorted the access logs from the three datasets based on IP first and then User-Agent. We marked any requests from IPs that update their User-Agent field more than 20 times as robots. Appendix C provides a real-world example where the IP address is changed for each request.

3.4.4 Requests to robots.txt file

A robots.txt [37, 69] file contains information on how to crawl pages on a website. It helps web crawlers control their actions so that they do not overburden the web server or crawl web pages that are not intended for public viewing. As a result, a request for the robots.txt file can be considered an indication of a robot request. We identified any user who made a request for robots.txt as a robot. Appendix D provides a real-world example where requests are made to the robots.txt file.

3.4.5 Browsing speed (BS)

We used browsing speed as a criterion to distinguish robots from humans. Robots can navigate the web far faster than humans. Castellano et al. [13] found that a human would only make a maximum of one request for a new web page every two seconds. Similar to the previous study [7], we classified any session with a browsing speed faster than one HTML request every two seconds (or, \(BS >= 0.5\) requests per second) as a robot. We experimented with an alternate approach involving browsing speed using a three-way criterion (visit duration exceeding 60 s, surpassing a threshold of 10 pages, and a browsing speed threshold of 0.25 pages/s), which was proposed by Tanasa et al. [62] in 2004. However, this approach resulted in a significantly lower detection rate of bots. Therefore, we opted to maintain the threshold set by Castellano et al., which had also been used in previous work by AlNoamany et al. Appendix E provides a real-world example where we can see several requests within a couple of seconds, which is unusual for human behavior.

3.4.6 Image-to-HTML ratio (IH)

Robots tend to retrieve only HTML pages, therefore requests for images can be regarded as a sign of a human user. A ratio of 1:10 images to HTML was proposed by Stassopoulou and Dikaiakos [60] and used in the prior study [7] as a threshold for distinguishing robots from humans. We flagged a session requesting less than one image file for every 10 HTML files as a robot session. IH was found to have the largest effect in detecting robots in the prior study’s dataset, and this holds true for our three datasets as well. Appendix F provides a real-world example where a session is marked as a robot using the IH ratio.

We used the aforementioned heuristics on our three datasets to classify each request as human or robot. If a request/session has been marked as a robot at least by one of the heuristics, we have classified it as a robot. After bot identification but before reporting the final results, we performed S2 as described in Sect. 3.2.

4 Results and analysis

In order to investigate the data further after S2 data cleaning, we divided the dataset into two subsets, human sessions, and bot sessions. For each dataset, we used these two subsets to determine user access patterns and compare them to robot access patterns. Finally, we conducted a temporal analysis of the requests in both subsets for each dataset.

4.1 Robots versus humans

Table 3 reports the number of detected robots for each dataset based on the total number of sessions and the total number of requests. We counted the number of requests classified as robots based on each heuristic independently (as mentioned earlier, the heuristics are not mutually exclusive, so these numbers across a column do not need to add to exactly 100%). The final row in the table represents the total number of sessions and requests that are marked as robots after applying all the heuristics together.

Table 3 Bot identification results based on the total number of sessions and the total number of requests for each dataset: IA2012, IA2015, IA2019, and PT2019 (the header for each column displays the total number of sessions and requests). The heuristics are not mutually exclusive

The image-to-HTML ratio (IH) had the largest effect on detecting robots across all four datasets. The impact of IH was \(\approx \)85-90% in IA2012 and \(\approx \)75-80% in IA2015, but only around \(\approx {55-65\%}\) in IA2019. In PT2019, \(\approx {80-96\%}\) of robots were detected using the IH ratio, which is higher compared to IA2019. In PT2019, we were able to detect almost all the robots through this one heuristic, IH. We found that \(\approx {90\%}\) of requests were robots in IA2012, \(\approx {88\%}\) of requests were robots in IA2015, \(\approx {70\%}\) of requests were robots in IA2019, and \(\approx {98\%}\) of requests were robots in PT2019.

The reason for this increase in human sessions in 2019 than in 2012 and 2015 could be the increase in awareness of web archives among human users over the years. In addition, headless browsers, such as Headless Chromium [11], PhantomJS [26], and Selenium [56], that provide automated web page control have also become popular in recent years. Their functionality simulates a more human-like behavior that may not be caught easily by bot detection techniques. For instance, applications like the work of Ayala [54] and tools like the oldweb.today [38, 39], DSA Toolkit [35, 36], TMVis [41], and Memento-Damage service [58] that replicate human behavior make things challenging for detection algorithms. Between IA2019 and PT2019, PT2019 has \(\approx {30\%}\) more robots present. Based on our PT2019 dataset, only 2% of all requests coming into the Arquivo.pt are potential human requests.

4.2 Discovering access patterns

Upon distinguishing robots from humans, we divided all four of our datasets into human and bot subdatasets (IA2012_H, IA2012_B, IA2015_H, IA2015_B, IA2019_H, IA2019_B, PT2019_H, PT2019_B). We used these datasets to identify different access patterns that are followed by both human and robot sessions. As introduced in Sect. 2, there were four different user access patterns established by AlNoamany et al. [7]. We looked into each of these patterns and identified their prevalence in our three datasets. We discovered the prevalence of sessions that followed each of the four patterns (Dip, Dive, Slide, Skim), as well as sessions that followed a hybrid of those patterns (“Dive and Slide,” “Dive and Skim,” “Skim and Slide,” and “Dive, Slide, and Skim”). We categorized requests that do not fall into any pattern as Unknown.

Fig. 3
figure 3

Access patterns of robots and humans in our subdatasets (IA2012_H, IA2012_B, IA2015_H, IA2015_B, IA2019_H, IA2019_B, PT2019_H, PT2019_B). The color of the stacked bar distinguishes between requests for mementos (URI-Ms) and TimeMaps (URI-Ts). Each chart is sorted in descending order by x-axis value (request percentage). Note that the x-axes in the charts are not the same

Figure 3 shows a chart for each subdataset. The horizontal (x) axis represents the percentage of the number of requests and the vertical (y) axis represents the different patterns or a hybrid of patterns. The percentages are based on the total number of requests for each subdataset. According to AlNoamany et al.’s findings based on the IA2012 dataset, Dips were the most common pattern in both human and robot sessions. However in our IA2012 and IA2015 datasets (full-day), Dive and Dip account for about the same percentage of human sessions and Skim is the most common pattern among robot sessions. Dip is the most common pattern in IA2019, followed by Dive, Slide for both human and robot sessions. The human Dips have doubled from IA2012 (24%) and IA2015 (26%) to IA2019 (51%) indicating that more humans are accessing web archives to access a single URI-M or URI-T in 2019 than the previous years. There are a high number of robot Skims in IA2012 and IA2015 compared to IA2019. In IA2012 robot sessions, it is over 90% Skims and in IA2015 robot sessions, it is around 80% Skims. We could see that the long-running robot sessions that request URI-Ts account for most of the Skim percentage. In contrast to IA2019, PT2019 humans exhibit a higher percentage of Dive and Slide (45%) than Dips (29%). Even in robot sessions, Dive (70%) and Dive and Slide (24%) percentage is higher than Dip (6%).

Table 4 Proportion of robot and human accesses in each of the four datasets (IA2012, IA2015, IA2019, and PT2019) for TimeMaps and Mementos

The percentage of accesses by humans and robots to TimeMaps and Mementos over the four datasets (IA2012, IA2015, IA2019, and PT2019) is shown in Table 4. In IA2012, robots almost always access TimeMaps (95%) and humans access mementos (82%). This trend continued in IA2015 with robots accessing TimeMaps 83% of the time, while humans accessed mementos 88% of the time. However, in IA2019, humans and robots almost always access mementos (96%), whereas only 4% of those accesses are to TimeMaps. When looking at the hybrid patterns, PT2019 bot sessions only have a maximum of two patterns while the rest have a small percentage of all three patterns (Dive, Skim, and Slide). For each dataset in IA, there is a very small percentage of requests (4.22% in IA2012, 3.75% in IA2015, and 0.97% in IA2019) that do not belong to any of the patterns. We were able to identify all the different patterns in the PT2019 dataset. The percentage of human requests falling under the Unknown category in IA2012 (4.02%) and IA2015 (3.42%) is higher compared to the IA2012 robot requests (0.2%), IA2015 robot requests (0.33%), IA2019 human requests (0.85%), and IA2019 robot requests (0.12%).

4.3 Identifying temporal preferences

We also explored the requested Memento-Datetime in our subdatasets to see if there was any temporal preference by web archive users. Figure 4 illustrates the temporal preference of robots and humans in our datasets. The x-axis represents the number of years prior, meaning the number of years passed relative to the datetime of the access logs (e.g., for IA2012, 2 years prior is 2010) and the y-axis represents the number of requests. Note that the y-axis in each chart is different.

It is evident that the majority of the requests are for mementos that are close to the datetime of each access log sample and gradually diminish as we go further back in time. There is no significant difference in temporal preference in IA2012, IA2015, and IA2019. IA2019 humans, IA2019 bots, and PT2019 bots exhibit the same trend; however, it is difficult to see a trend in PT2019 humans due to the fewer number of humans in the dataset. For PT2019 humans, there is a spike around 4–5 years prior which implies PT human accesses were mostly for mementos around 2015–2016. There is an advantage to knowing the temporal preferences of web archive users. Web archives can prioritize or store data in memory for the most recent years to speed up disk access.

Fig. 4
figure 4

Temporal preference of bots and humans in IA2012, IA2019, PT2019 datasets

5 Future work

AlNoamany et al. [7] observed four different user access patterns in 2013. In our datasets combined, 0.48% of requests were outside of any of these patterns or their combinations. One may look into if the percentage of requests that fell into the Unknown category have any other generally applicable patterns, or if they are completely random. The overall number of robots identified in IA2019 is much lower than in IA2012 and IA2015. We would like to repeat this study on more distinct full-day datasets to see if the reduction in robots is a general behavior from 2012 and 2015 to 2019 or specific to the day we chose. Additionally, the IH [60] and BS thresholds [13] in our bot identification heuristics are based on the behavior of conventional web servers; however, it remains to be determined if the same thresholds apply to web archival replay systems, as the dynamics of web archival replay systems differ (e.g., the Wayback Machine is typically slower than a typical web server).

6 Conclusions

We used a full-day access logs sample of Internet Archive’s (IA) Wayback Machine from 2012, 2015 and 2019, as well as Arquivo.pt’s from 2019, to distinguish between robot and human users in web archives. The total number of robots request detected for IA2012 (90.94%) and IA2015 (87.99%) datasets is higher than the overall number of robots discovered in IA2019 (69.91%). We discovered that robot accesses account for 98% of requests (97% of sessions) based on 2019 server logs from Arquivo.pt. We also discovered that in IA2012 and IA2015, the most common pattern for robots were almost exclusively Skim, but that in IA2019, they exhibit all of the patterns and their combinations. Regardless of whether it is a robot or a human user, the majority of requests were for mementos that are close to the datetime of each access log dataset, demonstrating a preference for the recent past. In summary, these insights into users’ behaviors and temporal preferences can be leveraged to improve the efficiency of web archives by tailoring resource allocation accordingly. We believe that this will further strengthen web archives, enhancing accessibility, and preserving invaluable historical web content for diverse purposes.