1 Introduction

The dark web is an anonymous portion of the internet that evades traditional search engine indexing. As part of the deep web, it encompasses web pages not reachable via search engines and requires specific proxies and a direct URL for access [1]. This subset of the internet houses websites and online platforms that use advanced software to maintain the anonymity of both operators and visitors. Of these software tools, Tor (The Onion Router) is the most notable [2].

The Tor network is a decentralized, anonymous communication framework allowing users to browse the Internet and access resources while maintaining their privacy [3]. It uses a layered encryption system to ensure user anonymity. The Internet traffic of users in Tor is directed through a series of volunteer-run relays, or nodes, running on top of the traditional networked infrastructure. Each node only recognizes the IP address of the preceding and succeeding nodes, not the data origin or ultimate destination. The data packets traversing the Tor network undergo multiple encryption layers, making it exceedingly difficult for anyone observing the network to track the data source or endpoint, thereby effectively hiding the identity and location of the user [4].

While Tor is frequently used by individuals and organizations desiring heightened privacy and security, its anonymity features have also attracted illegal activities, such as illicit marketplaces, hacking forums, and other criminal operation [5]. The Tor network has intrigued the scientific community since its inception, with efforts focused on improving its privacy and anonymity protection capabilities, as well as tracking illegal websites and online platforms that it hosts [6].

Given this interest and the complex nature of this anonymous network, researchers have developed software tools for efficient data collection and analysis [7]. However, these tools tend to serve a single purpose and need easier integration capabilities for complex analyses [8]. This means that every possible different research in the field of the dark web involves design and implementation from scratch of software elements that overload and do not contribute to the more scientific and experimental part.

Moreover, the nature of the dark web introduces distinct particularities in data engineering [9]. Unlike the surface web, it operates primarily through overlay networks that are reached through special proxies [10]. The inherent network latency and accessibility challenges posed by the encrypted darknets highlight the need for delay-tolerant and paralleled designs [11]. Additionally, the content is notably dynamic and intermittent, with dark sites often altering their addresses to evade detection, requiring continuous and scaled monitoring for their identification before they become inaccessible [12]. In this context, robustness is crucial for data pipelines, which may suffer from elevated failures due to navigation and crawling interruptions. Given these unique challenges and considerations, our proposed framework offers modularity, generality, and the necessary features to navigate the dark web effectively.

This study introduces the design and implementation of a new, scalable framework compliant with dark web particularities, offering a solid architecture for easy integration of new workflows and use cases for dark web analysis. With the shared base architecture, it is possible to incorporate elements specific to the intended research without building basic components such as inter-component communication, data persistence, or log management for monitoring and robustness.

In addition to creating the foundational infrastructure, we designed and tested a use case for gathering Tor onion domains by web crawling both the surface web and the Tor network. This task is challenging due to the absence of a DNS system, lack of official Tor link repositories, and the complexity of the 56 character random base32-coded links used as web addresses [13]. This functionality is developed and tested within the proposed architecture.

2 Related works

This section examines a variety of software tools designed for analyzing and exploring the Tor network. Each tool has a unique purpose, as detailed in Table 1.

A set of tools is tailored specifically to crawl dark websites and scrutinize their content. The Dark Crawler [14] is a modification of a crawler aimed at locating and extracting data related to child exploitation, such as images, videos, keywords, and links on the Tor network. It employs distributed crawling via a seed list to automatically search Tor network websites and stores their HTML content and images. The crawled data can then be analyzed using different analytic tools. The Docker-based Tor crawler [15], designed with containerization and parallel Tor browsers, enhances the crawling performance of dark websites. It focuses on downloading HTML pages from Tor onion sites for further analysis. The Dark Web Threat Intelligence Analysis (DWTIA) Platform [16] comes equipped with modules for data acquisition, indexing, analysis, and visualization, making it ideal for investigators to examine data and information from the Dark Web. Meanwhile, the Analytical Framework for Dark Web Scraping and Analysis [17] is employed to scrutinize a specific underground marketplace. Its operation involves installing Tor, creating an AppleScript for scraping, running scripts, populating a database, and identifying suspects with Maltego. However, scripts must be customized for each targeted dark website.

Table 1 Comparative table of solutions for Tor network exploration and analysis

Specifically, there is a tendency to design methodologies for categorizing dark web resources with topic classification. For instance, the Black Widow Crawler [18] relies on a Docker-based microservice architecture which permits the combination of both preexisting and customized machine learning tools to categorize. The deployment consists of planning (identify dark web forums and acquire access), collection (establish anonymous access and collect raw data), processing (parsing raw HTML and extracting topics), analysis (infer relationships and trends), and dissemination (dashboards, alerts and reports). The Machine-based Suspicious Service Detection And Labelling (MASSDEAL) [19], developed in Python 3.6, automates the exploration and categorization of the Tor network. The sequential operation comprises dark web content retrieval, filtering out duplicate or invalid onion services, and classifying the content according to different classes. Analogously, the Tor-oriented Web Mining Toolkit [20] is a modular architecture designed for data mining the dark web. It is built with a coordinator, a crawler, an extractor, and an analyzer. Its workflow involves visiting Tor network pages and classifying the content with a semantic engine.

The Dark Web crawling system [21] is a crawler capable of collecting, cleaning and saving dark web pages in a document database, automatically classifying the gathered sites into five classes. On the other hand, the Dark web monitoring methodology [22] offers an ordered set of coupled tasks for categorizing Tor sites. The main components comprise analyzing word occurrences, category classification, and visualization of resulting plots. The Automated Tool for Onion Labeling (ATOL) [23] was designed for collecting and categorizing Tor onion services. It has a crawler activated daily that collects links from a predefined seed list with ElasticSearch and different extracting strategies with Scrapy, deep site crawler, registration forum crawlers and interactive forum spiders. Two-stage thematic labeling is performed by learning descriptive and discriminating keywords for different categories and using these terms to map onion site content to a set of thematic labels.

On a smaller scale, alternative tools like PASTOR [24] offer a microservice architecture and status-checking functionalities and primarily serve to collect Tor onion addresses and their advertisers. Finally, the MEMEX project [25] was launched by the United States Defence Advanced Research Projects Agency (DARPA) to implement a dark web search engine. Sophisticated features based on data mining and artificial intelligence were integrated to analyze images, datasets and text.

Despite the modularity, scalability, and potential for extended functionality in some of these tools, as shown in Table 1, none offer a framework to facilitate the deployment of new use cases and workflows. Instead, each tool is purpose-built for specific tasks, motivating the design of our new framework to allow easy adaptation for various custom functions.

3 Design of the framework

The proposed design for the framework is depicted in Fig. 1. The architecture presented is a modular design divided into three logical layers (control, logic and operations) and a module of tools. Complementary, the external management is done through a Representational State Transfer Application Programming Interface (REST API), the communication within the system is asynchronous and message-based for stream processing, a global database is used to persist the raw data or processed artifacts and logs are compiled for debugging and monitoring purposes (in case unexpected events occur during an execution). Orchestrated architectures offer an advantage in terms of adaptability over monolithic architectures for analyzing the dark web. This modular approach allows for greater scalability and ease of maintenance, as components can be updated independently to keep pace with emerging threats and technologies. In contrast, monolithic architectures usually become cumbersome, inflexible and difficult to adapt to new requisites as their infrastructure grows.

Fig. 1
figure 1

Layers and infrastructure elements of the framework

3.1 Control layer

The Controller is the sole component of this layer, serving as the central orchestrator of the entire framework. Its primary role involves the creation and management of the infrastructure. This includes establishing the interconnection network between the different architectural layers and setting up the individual elements of the framework.

Furthermore, the Controller is tasked with the scalability of those platform elements specifically designed for this purpose, among other responsibilities. This component also handles all API requests, facilitating seamless communication with external entities. In particular, the actions supported are launching, scaling, pausing, resuming and terminating the different elements of the architecture.

3.2 Logic layer

Executions serve as the building blocks of the logic layer, bridging the gap between the control and operations layers.

In simple terms, an Execution is a specific workflow a user wishes to enact within the framework. For instance, this could be an algorithm designed to accomplish a particular task related to studying the dark web. Some examples of potential Executions might include the scraping of websites deployed on the dark web, collecting darknet links such as I2P or Tor domains through web crawling techniques, periodically monitoring the status of dark web sites, or classifying underground forums and marketplaces into different categories based on their content.

The term ‘Execution’ can be likened to a computer program, where multiple instances can be initiated per user requests. To complete their tasks, Executions must utilize the Operations defined within the framework’s final logical layer.

3.3 Operations layer

Operations are procedures designed to accept inputs, perform the necessary transformations or processing, and then return an element of either the same or different type that was initially provided. In this line, if we consider Executions to be similar to computer programs, then Operations can be compared to the functions in a library that these programs leverage.

Each Operation should function independently and must not retain any state or information between calls. Furthermore, these Operations need to be replicable by the framework, providing a means to increase processing power if the user requires it. For example, various types of Operations might include conducting a port scan on an IP address to identify active ports, performing web crawling on the web surface from a specific URL to gather onion addresses, returning the results of a search engine query, or carrying out a steganalysis of an image.

3.4 Tools module

Beyond the logical layers, the framework also offers the capability to incorporate additional components, referred to as Tools. These are not part of the base infrastructure, these architectural elements function like plugins and can be utilized by both Executions and Operations interchangeably.

For instance, Tools might include proxy servers that redirect network traffic for accessing darknets or auxiliary databases for temporary data storage of large files, which cannot be returned directly to Executions by Operations.

3.5 Underlying infrastructure

The fundamental strength of this design lies in its broad-ranging application potential, allowing for seamless integration of new workflows. If programmers wish to implement any feature, they simply need to develop a new Execution with the specific use case algorithm and integrate the corresponding Operations and Tools. Once completed, the user can deploy the framework and initiate multiple instances of this new workflow.

The separation of Operations from Executions, placing them in distinct logical layers, enables the reuse of Operations in novel and future workflows beyond their original intent. This flexibility is pivotal for the framework’s growth in versatility, interoperability, and functionality as more Operations and Tools are integrated. In this scenario, the infrastructure is equipped with universal functionalities designed to support the provisioning of the aforementioned Executions, Operations, and Tools.

3.5.1 Asynchronous message-based communications

Asynchronous message-based communications enhance the robustness of the framework. They allow interaction between different Executions, Operations, and Tools, even under heavy loads. This mechanism ensures that the system remains responsive and resilient without being blocked by any single operation.

3.5.2 Database

The database is a core component of the underlying infrastructure. It stores the necessary data and information used by the Executions, Operations, and Tools. The database design should be scalable and efficient to handle potentially large amounts of data and provide quick and accurate retrieval of data when required.

3.5.3 Logs management

Logs management plays a crucial role in maintaining system reliability and in troubleshooting. It helps in tracking the activities of Executions, Operations, and Tools, enabling real-time monitoring and alerting. This aspect of the framework provides valuable insights into the system’s performance and can significantly aid in diagnosing and addressing potential issues.

Fig. 2
figure 2

Implementation of the framework

4 Implementation

As depicted in Fig. 2, the implementation of the framework is based on a microservice architecture, particularly opting for Docker Swarm as the base technology [26]. This decision maps directly onto the framework requirements for modularity, scalability, robustness, and versatility [27]. The decision to opt for Docker Swarm over Kubernetes was made to allow the framework to be executed on lower-resource machines, enabling its use in a wider range of environments compared to Kubernetes, which is predominantly used for complex and extensive deployments. In contrast to the latter, the operational simplicity of Docker Swarm also facilitates the modification and updating of the underlying infrastructure by individuals who may not have been directly involved in its initial development, thus enhancing its accessibility and maintainability.

The flexibility and scalability of a microservice architecture allow each Execution, Operation, and Tool to be developed, deployed, and scaled independently. This matches our modular design, where each component acts independently, enabling optimization of resource usage and performance. In addition, the isolated nature of services ensures fault tolerance, preventing failures in one service from affecting others.

The microservice architecture further aligns with our framework through technology heterogeneity. This characteristic allows each service to be developed in the most suitable programming language or technology stack, in line with the specific requirements. Moreover, in terms of maintenance and updates, the microservices can be updated individually, minimizing system downtime and enhancing the continuous integration of parallel workflows.

The following subsection describes the different implemented elements, which are programmed with different technologies and deployed as individual or replicated containers. We discuss the underlying base infrastructure and the set of Executions, Operations and Tools related to a particular dark web workflow.

4.1 Infrastructure implementation

Three essential open-source technologies are incorporated into the framework infrastructure to support any workflow to be integrated.

4.1.1 Asynchronous message-based communications

This framework manages communication between microservices through an asynchronous message-based approach that uses stream processing. Apache Kafka, the technology adopted, offers the ability for real-time processing and the decoupling of services and workloads using its topic publication-subscription model [28]. Kafka topics are created by the Controller for various communication channels and partitioned to facilitate parallel processing in replicated services. In this sense, microservices in this setup can act as publishers, consumers, or both, providing maximum flexibility and scalability. The Controller creates one Kafka Topic per Operation and Tool and names it homonymously to the component that will use it. In the case of Executions, two Kafka topics are created: one to retrieve the results of the Operations and Tools used and a second one to receive messages from the Controller. These are named with the name of the execution and its ID; the latter is set by the Controller in an environment variable when the Execution’s instance is created. To allow interoperable communications among Executions, Operations and Tools inside the framework, the messages are serialized using JSON format with the following structure:

  • Header: keeps the metadata of the message, i.e., the Kafka Topic of the last sender, the ID of the Execution that created this message and the type of data it contains, for example, a resource, an advertiser or an onion service.

  • Execution Field: maintains information between operation or tool calls. This field can only be modified by the executions.

  • Operation Header: indicates the specific parameters required to use a particular operation or tool.

  • Content: specifies the result, or the error, produced by the operation or tool that processed the message.

The use of the Execution field addresses the cost associated with the tracking of the messages within an internal execution data structure to construct a definitive object destined for database storage. Finally, although most of the communication can be made through Kafka using these messages, there may be some operations or tools that can be reached using other types of communication. For example, as we will comment on later, the Tool Tor Proxy in our implementation is used the same way as a normal web proxy.

4.1.2 Database

PostgreSQL, a widely used open-source relational database, is deployed for its structured organization, ACID (Atomicity, Consistency, Isolation, Durability) transaction support, flexibility, and the broad range of features [29]. This format promotes efficient querying and ensures data integrity during high-volume operations. Its versatility accommodates different data types and the open-source nature allows access to regular updates and enhancements. Although PostgreSQL is preferred because of its active community that continually enhances its performance, security, and scalability, in addition to the wide range of extensions and customizations to tailor the database to specific needs, the framework design allows for other database types based on specific user requirements. Finally, in case PostgreSQL is used, and new tables are needed, they can be specified inside a .sql file that the PostgreSQL service runs every time the framework is restarted.

4.1.3 Logs management

Given the distributed nature of the framework, an efficient log management system is crucial for tracking events and identifying potential errors during the deployment and execution phases. For this purpose, the ELK (Elasticsearch, Logstash, Kibana) Stack [30] is integrated into the infrastructure. This powerful toolset enables effective real-time monitoring, troubleshooting, and system maintenance.

In our implementation, Logstash enables centralized logging by collecting and processing logs from each microservice, Elasticsearch provides a scalable engine to analyze log data, and Kibana provides visualization capabilities, offering an intuitive interface for exploring and visualizing data in real time.

4.1.4 Implementation of a new component

To define a new Tool, Operation or Execution in the actual implementation of the framework, developers must follow these steps:

  1. 1.

    First, the name of the new component must be specified in a file containing the names of the other components of the framework, allowing the platform to deploy them when needed. In the case of Operations and Tools this will be when the platform is started, and in the case of Executions, at users’ request.

  2. 2.

    After that, a folder must be included inside the corresponding directory (Executions, Operations or Tools) with the program files and a Dockerfile to create the image of the container. The new component may need a wrapper to process the messages of its Kafka topic into a valid input format and vice versa, in case it needs to send the message back with the results.

  3. 3.

    Then, the new component must be defined inside one of the three Docker Swarm’s configuration files (Executions, Operations or Tools) as a new service, and configured to use the same overlay network and logging driver that the rest of the services.

  4. 4.

    Finally, in case of implementing a new execution, the initial parameters of the execution must be specified in another file to allow the API to deny any requests to start an execution with invalid arguments.

Regarding the configuration of Operations and Tools, if these can be configured individually for each request, this must be done using the messages’ operations header commented above. However, if this is not the case and the component can only be configured at its initialization, its configuration must be included in the Dockerfile or the service configuration specification in the corresponding Docker Swarm configuration file.

4.2 Workflow for web scraping of onion services

This section explores a specific use case related to the dark web. The primary goal here is to collect and extract data from HTML pages of Tor onion services. This data includes the page title, description, tags, and language. The process starts with a web crawl on the surface web. As onion services are identified, additional web crawling is undertaken through the Tor network to discover more onion domains.

To enable web scraping within our framework, we introduce a set of Operations and Tools. These form a catalog for this workflow, as well as for potential other dark web pipelines, as shown in Fig. 3. The Dark Web HTML Scraping Execution is a microservice that combines these Operations and Tools strategically to meet the specific web scraping objective.

The Controller, managed via the API, configures, launches, and interacts with the Execution through a Kafka topic named Control Dark Scraping. The Operations used can be replicated across any number of instances, with the associated Kafka topics divided into an equal number of partitions. The results of these independent, stateless Operations are then collected in the Results Dark Scraping channel, allowing the Execution to continue the workflow. This could involve populating or querying the database and publishing messages to continue invoking the Operations.

In practice, Fig. 3 depicts a streaming, distributed pipeline that is synchronized by the Execution through message brokers. This setup facilitates parallel and asynchronous processing across all elements, each of which can be replicated if needed. The outcome illustrates the modularity and scalability of the architecture in handling dark web workloads.

In the following, the different Operations, Tools and the workflow logic are detailed. It is important to define some terms before such descriptions:

  • Resource: Any online website that can be accessed, processed and parsed.

  • Onion service: Site available through the Tor network, which allows users to access them anonymously. The associated URL or domain is known as onion address.

  • Advertiser: Resource containing at least one onion address.

Fig. 3
figure 3

Integration of the Dark Web HTML Scraping use case in the framework

4.2.1 Supported operations

Four operations are defined and integrated into the framework for the dark web scraping use case, and also available for other different workflows.

Exploration in search engines

Input:

String to query, list of search engines and number of resulting pages to retrieve.

Output:

List of resources.

Description:

This Operation introduces the search query (such as “dark web links” or “ahmia”) in the different search engines to compile online resources matching with the query and maximum of pages. The implementation has been carried out with the search_engines library.Footnote 1

Each resource message is packed with useful information about each located webpage. It contains the resource’s URL and title, the timestamp of its discovery, the resource’s format and the search query used.

Crawling in the Surface Web

Input::

A web address.

Output::

An advertiser and list of onion services.

Description::

This Operation performs a DFS (Depth-First Search) web crawling, with a certain depth, from a given web address of the surface web and compiles the onion addresses it finds. To carry out its tasks, it makes use of a Python framework called Scrapy,Footnote 2 together with the Python reFootnote 3 module. This Operation returns two types of messages: advertiser type and onion service type.

The advertiser message is related to the received resource and is now referred to as the root advertiser. It includes its web address the HTTP code of the response and, if the connection is successful, the title of the page, its language, description, the tags of the page, the timestamp of when it was accessed, the format of the advertiser (in this case HTML), whether the advertiser is on the dark web or not, the crawled site predecessor of this advertiser, the number of onion services contained in this particular advertiser, the depth of the DFS, and the number of onion services that have been found with this depth.

The onion service messages are returned once per discovered onion address. Each message contains the onion domain found, the timestamp of when it was discovered, the depth from the root advertiser, and the advertiser in which it was found. Specifically, this last advertiser field is composed of the attributes of the advertiser message excepting for the last two (DFS depth and number of onion services), along with the web address of this advertiser, the timestamp of when this advertiser was discovered and the depth from the root advertiser.

Crawling in the Dark Web

Input::

A Tor address.

Output::

An advertiser and list of onion services.

Description::

This Operation receives a Tor address and performs a DFS web crawling using the Tor network to establish the web connections, returning the same advertiser and onion service messages explained before. In the implementation, the additional usage of a Tor proxy, particularly included in the Tools module, as explained later, is needed to connect to the Tor network.

Despite sharing the same underlying logic, given that this Operation can be considerably slower due to the building of Tor circuits for traffic routing, the distinction between the two Operations arose from the intention to scale this version to a higher degree.

Extraction of information

Input::

List of web and Tor addresses.

Output::

List of web items.

Description::

This operation takes web addresses, both normal and onion domains, accesses them and returns the following information associated with them: the timestamp of when they were accessed, the HTTP code of the response and, in case the connection was successful, the title, language, description and tags of the page from the HTML page received.

4.2.2 Provided tools

In the Tools module, a Tor proxy has been integrated to enable access to the Tor network. This component implements the Tor protocol and deploys Privoxy, a web proxy server that manages traffic redirection between the surface and dark web.

This tool is used by Operations that require access to the Tor network. For instance, Crawling in Dark Web and Extraction of information need to use this element as middleware to connect to Tor onion sites. To use this Tool, Operations redirect their HTTP requests to it in the same way a normal web proxy is used.

Fig. 4
figure 4

Workflow of the implemented Execution (web scraping of onion services)

4.2.3 Implemented execution

The logic of the presented use case is implemented in a particular Execution, called Dark web HTML scraping in Fig. 3 and its workflow can be seen in the flowchart of Fig. 4. The initial parameters of this Execution are i) a string list of the search queries, ii) a string list with the set of search engines to employ, iii) the number of the resulting pages to analyze from the chosen search engines, iv) the depth of both DFS crawlings (on the surface web and the Tor network), and v) a boolean to indicate the use of Tor circuits for surface sites. Briefly, the workflow proceeds as follows:

  1. 1. Resource gathering

    To initialize the Executions is required to establish the search queries and the search engines to be used. Then, these parameters are sent to the Operation Exploration in search engines to collect all the resources associated with them and, when ended, send the resulting resource messages back to the Execution.

  2. 2. Crawling the surface web

    The resource messages received by the Execution are handled and sent to the Crawling in Surface Web Operation to start the crawling process on the surface web (2.1). After this, depending on the type of result message received by the Execution, one logic or another is applied. In case the root advertiser (2.2) message arrives and states that onion services have been found, its information is stored in the database. Additionally, the relation between the root advertiser and its predecessor is also stored in the database for future analyses.

  3. 3. Extracting information

    Suppose the received message is an onion service message. In case this onion service has not been found previously by this Execution’s instance or if it has not been accessed more than 8 h ago, then it is sent to the Extraction of information Operation (3.1). On the other hand, if the onion service was already discovered and recently accessed, the information about its advertiser and the advertiser’s predecessor is stored (3.2). The decision to periodically access an onion service, even if it has been accessed before, is motivated by the intention to examine their lifetime. These services are typically volatile in nature, and this approach allows for future study and analysis.

  4. 4. Crawling the Tor network

    Once the information of the onion service is collected and the onion service message is back in Execution, its information is stored in the database with the information of its advertiser and the advertiser’s predecessor.

After this, if the onion service is reachable and it has not been stored in the database as an advertiser, then it is sent to the Crawling in Dark Web Operation to perform a crawling on it, this time using the Tor network. Subsequently, the Execution enters a loop coming back to step 3. It is important to emphasize that the Execution algorithm does not end up falling into an infinite loop because of the check condition of this fourth step.

Due to the complexity of tracking the exponentially increasing number of results messages, the Execution is considered terminated when it does not receive any more messages during a predetermined time interval. Another way to terminate the Execution is to send it an end control message.

5 Experimentation

To test the capabilities of the proposed solution in different configuration scenarios, four experiments have been conducted making use of the Execution currently implemented.

5.1 Experiments

The variations between the four experiments lie in the combination of scaling, before deploying the Execution instance, the number of the services’ replicas in the framework together with the use of different depths for both DFS operations in each experiment; the rest of the parameters remained the same for all tests.

Table 2 shows the four experiments’ configurations. In experiment one the framework has been scaled up and the depth of both DFS Operations (Sc/D1) has been set to one; in experiment two the framework has not been scaled and, again, the depth of the DFS Operations has been set to one (No Sc/D1); experiments three and four are analogous cases but using depth zero for both DFS Operations (Sc/D0 and No Sc/D0, respectively).

Table 2 Experiments’ performed

These configurations has been selected to study the impact on the machine´s resources consumption when the framework is scaled up, along with the effect that DFS’ depth causes on the gathering of onion services. This will prove useful to build an intuition about which would be the best configurations to use, in addition to support the use of the architecture proposed when dealing with tasks related to the Dark Web.

5.2 Experiments configuration

The experiments have been carried out on a physical server with the following specifications: 30 CPUs of 2300 MHz (1 Core/CPU), 78 GBs of memory and a 10 Gbit/s network interface.

The experiments have been conducted sequentially. For each one, the framework has been running for 16 h with only one instance of the Execution and with the initial parameters shown in Table 3. The initial query used with the chosen search engines, in this case Google and Yahoo, will not be revealed to avoid promoting access to illicit content and the number of result pages to analyze has been established to 10. In addition, as commented above, the depth for both DFS, outside and inside Tor, has been set up to zero or one, depending on the experiment, and the Tor network has not been used to established connections outside the deep web.

Table 3 Initial parameters of the four executions instances

Finally, in the experiments where the framework has been scaled up, the number of replicas of the different services have been: 15 replicas of the Operations Crawling in Dark Web and Extraction of information, and 10 replicas of the Operation Crawling in Surface Web. In the other two tests the number of replicas of these Operations are set to one.

Fig. 5
figure 5

Accumulated unique onion domains discovered over 16 h by each experiment

5.3 Performance results

From this point forward we will use the notation [\(x_1; x_2; x_3; x_4\)] to denote the results of the experiments Sc/D1, No Sc/D1, Sc/D0 and No Sc/D0 respectively.

During a 16-hour period, the Operation Crawling in Surface Web identified [692; 650; 634; 581] onion domains ([216; 193; 216; 198] unique ones) in the surface web. In contrast, the Operation Crawling in Dark Web recorded [504,664; 34,478; 440,191; 109,725] onion domains ([84,312; 12,138; 74,186; 12,719] unique ones) in Tor onion services. Lastly, the number of distinct onion domains shared by both Operations amounts to [157; 136; 162; 140]. In this sense, the total sum of unique onion domains is [84,371; 12,195; 74,240; 12,777].

Figure 5 displays an accumulative graph depicting the number of unique onion domains found over time, with the vertical axis representing the total count and the horizontal axis representing the number of 10-minute intervals that have passed. Notably, approximately [1.3, 3.3, 2.5, 10] hours are needed to compile more than [80K, 10K, 70K, 10K] different onion domains.

Fig. 6
figure 6

Box plot of the seconds needed for Operations Crawling in Surface web and Crawling in Dark web in the experiment’s scenarios with depth zero to establish a web connection. Mean Value represented by an ’X’

Throughout the data collection process, a total of [78,555; 6,617; 68,772; 6,769] accesses were made to onion services, which is a considerable throughput considering the time-consuming task of connecting to the Tor network. Figure 6 shows a box plot of the distribution of the seconds needed to establish a web connection (failed web connections have not been included) for the Operations Crawling in Surface web and Crawling in Dark web. As it can be seen, Tor connections tend to have a larger variation in their response time and, on average, need 1.5 s to be established; meanwhile, normal web connections, on average, can be established in almost a second; making a ratio of 3:2 web connections.

Fig. 7
figure 7

Latency analysis of the average number of seconds needed for Operations Crawling in Surface web and Crawling in Dark web in the experiment’s scenarios with depth zero to solve a request

This fact also affects the overall time needed for these Operations to process a request. This is shown in Fig. 7. Something remarkable about the chart are the differences between the average time needed between both Operations for parsing the HTML content and for processing and sending the results back to the Execution, given the results previously commented, in which the Operation Crawling in Dark web outweights the Operation Crawling in Surface web in relation to the number of onion domains found. However, this is a consequence of that most of the onion services tend to advertise much fewer, or none, onion domains in comparison to the advertisers collected in the Surface Web. This results in treating those onion services, which are the major contributors to Fig. 5, as outliers and thus discarding them from these calculations.

The number of successful connections that received an HTTP 200 status code was [76,657; 6,233; 66,555; 5,882] ([97.58%; 94.2%; 96.78%; 86.9%]). In the remaining accesses, [80; 13; 59; 18] requests ([0.1%; 0.2%; 0.08%; 0.26%]) returned 4xx status codes (client errors) and [1,818; 371; 2158; 869] visits ([2.31%; 5.61%; 3.14%; 12.84%]) raised 5xx status codes (server errors).

The results commented throughout this subsection indicate that scaling the framework and taking advantage of the consequent parallelism lead to a substantial difference in the outcomes between each pair of experiments, with the same DFS’ depth, in terms of quantity and velocity of accessing and gathering onions domains in short periods of time. On the other hand, when comparing the results in the other dimension, DFS’ depth, although its effect in experiments 1 and 3 produce a quantitative difference of almost 10K onion domains, it is shown with experiments 2 and 4 that in the short term increasing this parameter without exploiting any kind of parallelism may not give a considerable variance in the final results.

Finally, it is worth considering which onion domains have been the most advertised in the different resources crawled by the framework. Figure 8 shows a list of 20 onion domains which correspond to the top 10 onion domains with more different advertisers in our database found by each experiment, ranging from 50 to almost 6,500 occurrences. The content associated with the top eleven onion domains of the chart correspond, from top to bottom, to: three dark web escrow services; two services that were down during these experiments and we could not identified them; four lists composed of fraudulent Tor domains; a web services related to the distribution of an uncensored version of the bible for the Darknet and a blog. Unfortunately, all of the HTML pages associated with the last nine onion domains share the same title and tags related to the distribution of child abuse content. We hypothesize that they could be mirror sites of the same web service, but no manual inspection is expected to confirm that due to the sensitive nature of the content.

Fig. 8
figure 8

List of 20 onion domains corresponding to the top 10 onion domains of each experiment with more ocurrences in different advertisers. Part of the addresses has been obfuscated to avoid promoting access to illicit content

5.4 Resource consumption

Ultimately, a resource consumption analysis has also been conducted to evaluate the impact of the different experiments. Figures 10 and 9, in addition to the base consumption of the server, show the average percentage of CPU and memory used by each one of the four experiments.

The percentage of memory consumption in Fig. 9 caused by the different experiments ranges from [18.32%; 14.7%; 17.75%; 14.49%] up to [24.39%; 17.52%; 24.15%; 17.94%], i.e., between [14.28; 11.46; 13.84; 11.3] and [19.02; 13.66; 18.83; 13.99] GBs of memory. These results are in correspondence with the results of the previous subsection. The incremental memory consumption that Fig. 9 shows is connected to the high volume of communication messages, specially in the experiments 1 and 3, that the framework generates and which are logged for debugging purposes. The memory fall at the end in the experiment 1, relates to the garbage collector called by Logstash service to free the heap memory of the JVM (Java Virtual Machine).

With respect to the CPU consumption, Fig. 10, it varies from [1.53%; 0.167%; 1.37%; 0.233%] up to [16%; 7.1%; 16.1%; 11%] of the total capacity of the server. The data is again in correspondence with the above comments. Regarding the anomalous peaks that can be observed in the four experiments, these correspond to the finding of the large advertisers commented in the previous subsection. This fact is supported by the coincidence of times between these peaks and the growths observable in Fig. 5.

As shown, these results contribute to promote the use of a microservices architecture when several tasks of this nature are to be replicated, which will facilitate the distribution of the workload among the various nodes that may be part of the cluster.

Fig. 9
figure 9

Memory consumption of the four experiments

Fig. 10
figure 10

CPU consumption of the four experiments

6 Conclusion and future work

In this paper, we have proposed a general and scalable framework that enables easy integration of new workflows related to the analysis of the Tor network thanks to a modular design with a control layer, a logic layer with the different use cases (Executions), an operations layer (catalog of Operations available) and a tools module (list of plugins to provide specific features). A controller is responsible for managing the infrastructure, which is built on top of asynchronous message-based communications, monitored through log ingestion and management and persisted in a relational database. This design has been mapped to a microservice architecture with Docker Swarm, integrating Apache Kafka as a message broker with a few topics for connecting components, ELK (ElasticSearch, Logstash and Kibana) stack for logs management, and PostgreSQL for data storage.

The underlying infrastructure has been used to implement a workflow for scraping web elements from Tor onion services. This use case precised the design and implementation of one Execution with the logic of the workflow, four Operations as functions that are invoked asynchronously (Exploration in search engines, Crawling in Surface web, Crawling in Dark web and Extraction of information), and a tor-proxy as a Tool for the connections to the Tor network. The resulting snapshot of the framework is a streaming pipeline with a flow of data that continuously collects the Tor onion address, title and keywords, among other details, by executing the pool of operations until the expected end.

The framework has been tested in a real use case related to the scraping of web elements of Tor onion services, proving its efficiency and capabilities. First, the scalability is tested with the deployment of 15 replicas of the Operations Crawling in Dark Web Operation and Extraction of Information, and 10 replicas of the Crawling in Surface Operation. In the short window of 16 h, the framework has been able to collect, in the best case, over a half million onion domains, where 84,371 were unique ones, and making a total number of 78,555 accesses through the Tor network being 76,657 (97,58%) of them successful. These results evidence an optimal performance, confirming that the microservice architecture with the integration of message brokers and horizontal replication are an effective approach for the cost tasks of analyzing the dark web.

Nonetheless, although the framework has proven fruitful, some improvements could be made to enhance the current state of its capabilities and implementation. Now, the framework does not allow the integration of new Executions, Operations or Tools on the fly, that is, while it is running. Having to restart the framework every time a new component is developed and ready to test can burden the final user. Therefore, changing the actual implementation to incorporate this functionality is a key step that will improve the quality of life of the framework.

Another major enhancement is the implementation of some mechanisms to allow Operations and Tools to be able to scale down without producing an information loss in the Executions that may be using them, in case a reduction in the processing power is required by the user.

This framework could be highly useful for researchers interested in developing new analytical tools. Suppose these are designed from scratch to be integrated into the proposed framework instead of being developed isolated from each other. In that case, the researchers will leverage the ready-to-use features of message brokers, log management, data storage, Operations and Tools, focusing their valuable effort on assembling new tailored workflows on dark web investigations, and extending the set of interoperable Executions, Operations and Tools.

Finally, beyond the technical advancements and potential improvements, an ethical dimension arises when discussing the dissemination of our framework. While releasing to the public can foster collaboration and rapid advancements, it also poses risks, especially given the sensitive nature of the dark web. Releasing such a tool to the public might inadvertently aid malicious actors. Therefore, it is worth contemplating a controlled release strategy. One approach could be to make the framework available only to vetted researchers, academic institutions and law enforcement agencies. This ensures that the tool is used responsibly and for its intended purpose: to provide a powerful infrastructure for developing and launching lawful investigations. Balancing transparency with security and ethical considerations will be paramount as we decide on the future accessibility of our framework.