The Agent Web Model -- Modelling web hacking for reinforcement learning

Website hacking is a frequent attack type used by malicious actors to obtain confidential information, modify the integrity of web pages or make websites unavailable. The tools used by attackers are becoming more and more automated and sophisticated, and malicious machine learning agents seems to be the next development in this line. In order to provide ethical hackers with similar tools, and to understand the impact and the limitations of artificial agents, we present in this paper a model that formalizes web hacking tasks for reinforcement learning agents. Our model, named Agent Web Model, considers web hacking as a capture-the-flag style challenge, and it defines reinforcement learning problems at seven different levels of abstraction. We discuss the complexity of these problems in terms of actions and states an agent has to deal with, and we show that such a model allows to represent most of the relevant web vulnerabilities. Aware that the driver of advances in reinforcement learning is the availability of standardized challenges, we provide an implementation for the first three abstraction layers, in the hope that the community would consider these challenges in order to develop intelligent web hacking agents.


Introduction
As the complexity of computer systems and networks significantly increased during the last decades, the number of vulnerabilities inside a system similarly raised. Different types of attackers may try to exploit these varying vulnerabilities for their own benefits. Websites are especially of interest to malicious actors, so attacks against websites nowadays are an everyday event. In order to protect vulnerable systems, one of the best approaches is to emulate real attacks using the same methodology that hackers would use. This practice, named white hat hacking, has become a crucial part of critical information technology projects. When taking part into a white hat hacking project aimed at testing the security of a target website, ethical hackers attack the system and report all their findings to the system owner or administrator so that the vulnerabilities can be patched. Ethical hacking is normally a human job, since the attacker needs a high level of expertise in penetration testing, which involves typically human capabilities (such as experience, reasoning, or intuition) that are hard to codify.
Although full automation of penetration testing is very challenging, hackers rely on a range of automatic tools [10] [25] [2] to help them dealing with the number and the variety of possible vulnerabilities. In the case of web testing, there are many web security scanners that can help the work of a human tester. These tools can use predefined requests to check the existence of a vulnerability, and quickly generate security reports; however they have limited capability to carry out complex evaluations, and their findings must normally be reviewed by a human supervisor. Indexes of quality, such as the number of false positives and false negatives, highlight the limited coverage of these tools. New vulnerability detection scripts and general updates may be deployed to improve the performance of web vulnerability scanners, but these are usually one-time solutions; automatic improvements, relying, for instance, on learning from previous cases, are lacking. Furthermore, many web scanners are designed only to detect vulnerabilities, but not to exploit them. Specific tools can be used to exploit [7], with a moderate chance of success, targeted vulnerabilities, and thus further the understanding of the overall security of the system under study.
Machine learning (ML) techniques aimed at solving problems through learning and inference are now being adopted in many fields, including security [32]. Following their success in challenging tasks like image recognition [19] or natural language processing [34], supervised deep neural network models have been adopted to tackle security-related problems in a static context, such as program vulnerability detection [26] or malicious domain name detection [20]. However deep neural networks designed to solve static problems exploiting large data sets of examples do not conform to the more complex and dynamic problem of penetration testing. A sub-field of ML that may offer a more relevant paradigm to tackle problems such as web testing, is reinforcement learning. Indeed, reinforcement learning methods allow an agent to learn by itself in a dynamic and complex environment by trial-and-error and inference. Success on challenging games like Go [30] or Starcraft II [35] suggests that these algorithm may find soon use in the world of penetration testing. Recently, some applications of ML and reinforcement learning in the context of offensive security were developed; on the side of white hat hackers, DARPA organized in 2016 the Cyber Grand Challenge for automated penetration testing [11]; on the side of black hat hackers malicious bots are being provided with more learning functionalities.
Given the impact that artificial agents will have in the landscape of security, this paper aims at promoting research in this direction by proposing a modelling of penetration testing problems that may be used to train reinforcement learning agents. Our modelling effort follows two directions: we first examine the formalization of web hacking problems using standard models, and we then discuss abstractions of concrete instances of web hacking problems within our model. We call our generic model the Agent Web Model. Aware that a strong and effective driver for the development of new and successful reinforcement learning agents is the availability of standardized challenges and benchmark, we use our formalization to implement a series of challenges at different level of abstractions and with increasing complexity. We make these challenges available following the stan-dards of the field. Our hope is that these challenges will promote and advance research in the development of automatic red bots that may help in the tasks of penetration testing. This paper is organized as follows. Section 2 presents the main concepts related to web hacking and reinforcement learning. Section 3 discusses how the generic problem of web hacking may be reduced, through a set of formalization steps, to a reinforcement learning problem. Section 4 describes our own model for web hacking problems and describes instances of problems at different level of abstraction. Section 5 explains how realworld hacking problems may be mapped onto the Web Agent Model. Section 6 provides some details on the implementation of challenges based on our formalization. Finally, Section 7 discusses some ethical considerations about this work, and Section 8 draws conclusions and illustrates possible directions for future work.

Web hacking
The most famous and popular Internet service, the World Wide Web (WWW), has been running for many years [3]. Since its invention in 1989 it had undergone many developments, and nowadays it is one of the most complex services on the Internet. The HTTP protocol [9] used by these web services has been created for the communication within a client-server model. The web client, typically a web browser, sends a HTTP request to a webserver; the webserver, in turn, answers with a HTTP response. A HTTP messages consist of three main parts: the Uniform Resource Locator (URL), the HTTP header, and the HTTP body. The URL references the requested object. The header contains information on the state of the communication. The request header sent by a client specifies the web method (i.e., what to do with the object), client-related information (e.g., the type of the web browser), and cookie values referring to previous states of the communication. The answer header sent by a server contains the answer code to the request (e.g., file not found) and information related to the state of the communication (e.g., new cookie values with session variables). The body part of the HTTP message contains the payload of the communication. The request body may contain POST parameters sent by the client. The answer body usually contains the longest part of the message, that is, the web page content in Hypertext Markup Language (HTTP) format.
Web communication is well defined by the HTTP standard. In time, due to the high number of com-ponents participating in the web communication, the web protocol has become increasingly complex, opening room to different vulnerabilities [37]. On the client side, a minimal web client can be easily realized by instantiating a TCP connection and by sending HTTP requests via command line. However, to enjoy the rich functionalities provided by HTML, including the last standard HTML5, web browsers are normally used. A web browser is an application providing a graphical interface that shows a HTML page with all its components. A HTML page may also contain code in the form of client-side scripts (such as javascript or scripts in other embedded objects), which can be executed locally by a web browser. Any unintended or malicious client-side script can have serious consequences during the web communication. If an attacker can sniff web traffic without encryption, or if the message can be decrypted, then attackers can set up man-in-themiddle exploitations. On the server side, the webserver runs on a physical or virtual computer, which may expose non-HTTP-related vulnerabilities at the level of the operating systems, at the level of the applications running the webserver (e.g., Apache, Ngin-x, IIS) or by exposing other vulnerable services on the web. With respect to the HTTP protocol, one of the most significant, and potentially vulnerable, parts of website access is the server-side scripting engine. Server-side scripting makes possible for the server to accept input sent by the client in order to customize its web answer. Based on the input, a server-side script can create connections to other resources such as local files or database records. Many components, such as Content Management Systems (CMS), provide ready modules for different functionalities using some server-side scripts. Having a vulnerability in a CMS can expose millions of website that run the same vulnerable module.
Using the web protocol can thus expose several weak points that can be targeted by malicious actors. The type of the attacks can vary, but they can be categorized according to the information security triplet. Several attacks aim to break the confidentiality by accessing sensitive or confidential information; in such attacks, the attacker may be able to find hidden objects such as files or database data, or she may manage to escalate her privileges in order to access protected data. In other cases, object integrity is targeted, either to cause damage and annoyance or as a preparatory step before carrying out further action; for instance, an attacker may upload a command script to the website (changing the integrity of the site) and use it to further her attack with more options. The third type of attack addresses the availability of the service; overloading a web service with many request can cause a denial of service (DOS).

Capture the Flag
A Capture The Flag challenge (CTF) is a competition designed to offer to ethical hackers a platform to learn about penetration testing and train their skills [21]. CTFs are organized as a set of well-formalized and well-defined hacking challenges. Each challenge has one exploitable vulnerability (or, sometimes, a chain of vulnerabilities) and an unambiguous victory condition in the form of a flag, that is, a token that proves whether the challenge was solved or not. Usually, a CTF requires purely logical and technical skills, and they exclude reliance on side channels such as social engineering; moreover, challenges are normally designed to make the use of brute-forcing or automatic tools unfeasible.
The standard setup of a CTF is the so-called Jeopardy mode, in which all players target a single static system. More realistic setups may include the deployment of non-static services with evolving vulnerabilities, or the partition of players in teams, usually a red team, tasked with retrieving flags from the target system, and blue team, responsible for preventing the attacker from obtaining the flags.
In the case of web challenges, a standard CTF consists of a website hosting objects with different vulnerabilities, and containing flags in the form of special strings. Participants are required simply to collect the flag, and no further exploitative actions are required (such as, setting up a command and control system). Jeopardy-style web CTFs constitute collections of rigorous challenges: the environment in which to operate is well-defined, actions can take place only in the digital domain, and objectives and victory conditions are clearly stated. All these properties make CTFs interesting case-studies to develop artificial agents for penetration testing.

Reinforcement Learning
Reinforcement learning (RL) is a sub-field of machine learning focused on the training of agents in a given environment [33]. Within such an environment, agents are given the possibility to choose actions from a finite set of available actions; upon undertaking an action, they can observe the consequences of their actions, both in terms of the effect on the environment, and in terms of a reward signal that specify how good or desirable is the outcome of that action. The aim of RL is to define algorithms that would allow an agent to develop an action policy leading to as high a reward as possible in time.
The RL problem may be particularly challenging, as the space of actions for the agent may be large, the environment may be stochastic and non-stationary, and the reward signal may be sparse. However, despite these difficulties, RL has been proved successful in tackling a wide range of problems, such as mastering games [22,30] or driving autonomous vehicles [28]. The ability to learn in complex learning environment, such as Starcraft II [35], mirrors the sort of learning that a web hacking agent is expected to perform. RL algorithms may then offer a way to train artificial agent able to carry out meaningful penetration testing.

Related Work
Interest in training artificial red bots able to compete in a CTF challenge has been heightened after DARPA organized a Cyber Grand Challenge Event in 2016 in Las Vegas [11]. In this simplified CTF-like contest, artificial agents were given the possibility to interact with a system exposing a limited number of commands.
However, interest in the problem of modelling and solving hacking or penetration problems predates this event. Different formalizations of CTF-like problems or penetration testing have been suggested in the literature. Standard models relied on formalism from graph theory (e.g., Markov decision processes [27]), planning (e.g., classical planning [5]), or game theory (e.g., Stackelberg games [31]); a wide spectrum of models with varying degrees of uncertainty and varying degree of structure in the action space is presented in [14].
Model-free approaches in which the agent is provided with minimal information about the structure of the problem have been recently considered through the adoption of RL [12,8,23,24]. While these works focus on the application of RL to solve specific challenges, in this paper we analyze the problem of how to define in a versatile and consistent way relevant CTF problems for RL. Notice that, in parallel to this work, some of the problems presented in this paper have already been analyzed and solved with simple RL algorithms in [38]. This paper, however, reconsiders particular instances of problem tackled in [38] in a wider and more formalized perspective, presenting them within a layered framework of levels of abstraction.

Formalization of Web Hacking
In this section, we explore how the ill-defined problem of web hacking may be formalized using different types of standard models (Web hacking → CTF → game → RL problem).

From web hacking to CTF
As discussed in Chapter 2, real-world web hacking is an extremely complex problem, with vague success conditions and presenting a wide array of possible courses of action, ranging from the exploitation of publicly known vulnerabilities to reliance on non-technical side-channels like social engineering.
CTF challenges represent a way to specify web hacking problems. CTFs offer a clear, and yet realistic, way to define web hacking challenges. There are two important advantages in the modelling of web hacking as CTF: (i) CTF challenges have a well-defined objective, and unambiguous termination conditions (either in terms of flag retrieval or time expiration); and, (ii) CTF challenges define an initial restriction on the actions that can be undertaken by a participant (normally requiring all attempts and attacks to take place in the digital domain).
In this sense we can understand CTFs as a first step in the formalization of web hacking. However, this formalization is still too loose to be useful for machine learning; most importantly, the space of actions, while being implicitly defined, is still too unconstrained to be useful.

From CTF to game
To further our modelling, we can express CTFs in gametheoretic terms. Web hacking CTFs can be defined as a game: -P is a set of players, -A is a set of actions available to players, -u is a vector of utility or payoff functions, such that u i is the utility function for player i, The simplest instance of CTF is a 2-player game with |P| = 2, where one player is the attacker and the second player is the webserver. As long as the web CTF challenge is static, the webserver may be conceived as a player deterministically reacting to the actions of the attacker. As explained in Section 2.2, this basic CTF setup may be extended to adversarial multiplayer games with |P| = N , where players are partitioned in a red team and a blue team. In the following, we will focus our attention and our discussion on the 2-player game, although our considerations apply straightforward to the multiplayer case.
For any player, we assume the set of action A to be finite or countable, so as to allow an artificial agent to select its actions. Notice that this assumption of finiteness or countability is reasonable as long as a CTF takes place in a digital and discrete domain.
The utility function u i of a player allows for the encoding of the victory condition expressed by a CTF challenge. A stark binary utility function allows to assign a positive utility to the capture of the flag, and a null utility to everything else. More refined utility functions may allow to shape the behaviour of a learned agent more subtly.
A game-theoretic formalization can then be seen as a further step in the process of formalization of web hacking problems. The main contribution in this form modelling, contrasted with a generic CTF model, is the definition of an enumerable set A of possible actions. This provides the foundation for an agent to choose actions and learn its own action policy. Although game theory already provides tools to analyze web hacking as we have modeled it, this formalization is still not ideal as the modeling of a webserver as an active player results over-generic. In the case of interest, in which we have a single attacker targeting a static system, it would be more practical to describe the webserver as a static component of the game.

From game to RL problem
In the case of web hacking with a static system, the game-theoretic modelling over-defines the webserver by describing it as a player. Alternatively, we can model the game as a RL problem: where -S is a set of states the game may be in, -A is a set of actions, -T : S × A → S is a state transition function defining how states evolve given an initial state and an action, -R : S × A → R is a reward function defining the reward obtained by an agent after taking an action in a given state.
As in the game-theoretic formulation, the set of action A is a countable set of actions available to the agent.
The reward function R translates the utility funciton u from the game-theoretic modelling to the RL formalism.
Finally, the set of states S allows for the tracking of the state of the game. Notice that, although the state of the game is uniquely determined at any point in time, the agent may not be aware of it. This leads to a partially observable game, in which the agent has no certain knowledge about the current state of the system, but it has only belief over the possible states. Through its own local state, which encodes its imperfect knowledge, the agent tries to keep track of the actual state of the system. Notice that a completely observable game may be considered as a limit case in which all the beliefs collapse into delta functions.
This final RL formalization captures well enough the problem of web hacking: it is flexible enough to accommodate very different hacking challenges, but, at the same time, is constrained enough that all its component are well-defined so that standard RL algorithms may be used to train artificial agents. We will then make the RL formalization the theoretical blueprint of our model for web hacking.

The Agent Web Model
In this section we use the RL formalism defined in Section 3 to characterize our own model for web hacking. We then discuss how this generic model may be used to implement actual web hacking problems at different levels of abstraction.

The Agent Web Model
In order to define a RL problem, it is necessary to define the state transition function of the problem. In our context, this function represents the logic of the target webserver. Different systems, with different types of vulnerabilities, may be represented in different ways. To simplify the modelling of a webserver, we will represent it as a collection of generic objects. These objects are taken to represent entities of interest (e.g.: files, ports) that can be targeted by the actions A of an attacker. This simplification allows us to decompose the design of a target system, its logic and its states. Transition functions can be defined in a modular way with respect to specific objects, and the state of the system may be factored in the state of single objects.
The decomposition of a webserver into a collection of objects also allows us to easily define instances of webservers at different levels of abstraction. By defining the nature and the number of existing objects, and by defining which actions an agent can take in relation to the defined objects, we can immediately control the complexity of the RL problem at hand.
Moreover, another aim of ours in having a modular system defined in terms of individual objects, is the possibility of instantiating new challenges in an automatic, possibly random, way. Such a generative model of web-hacking problems would provide the opportunity to easily generate a large number of problems on which to train a RL agent.
We call this flexible, generative model to instantiate different types of web hacking problems, the Agent Web Model.

Levels of abstraction
Concretely, we define 7 different levels of abstraction for web hacking with increasing complexity in terms of the actions and the feedback that the agent can receive (see Figure 1). We start at level1 with the model of a very simple website, composed of basic files, without web parameters and sessions. At higher levels, we allow the agent to interact with more complex objects making up the website: requests to files can accept multiple input parameters with different web methods, as well as multiple session values. A hacking problem at level1 has a trivial solution which could be coded manually in a simple algorithm, but we will show that the computational complexity soon escalates as we move up in the levels. A hacking problem at level7 is close to realworld web hacking, where an attacker can even create its own objects on the target site (e.g. command script) and carry out complex exploitation strategies; this sort of problem is far from a trivial solution.
In the following, we discuss the details of the different layers of the Agent Web Model, including the number of states and actions that have to be handled in different levels. Except when explicitly stated, in all levels of abstractions we will assume that the objects on a webserver are files, and we will take a simple binary reward function R that returns a unitary reward when the agent accomplishes its task, and zero otherwise.

Level1 -Link layer
In level1, a website is composed by a set O = {file 1 , file 2 , ..., file N } of objects representing simple static HTML files. We take the first file to represent the index.html file inside the webroot. Files are linked to each others by pointers, and one of the files contains the flag. All the files can be accessed by the agent without restrictions; no parameters are required, and the HTTP  Table 1 for a summary of the actions, their parameters and their return values. Note that these actions can be performed only on files that the agent has discovered on the remote webserver. Without training a RL agent, a simple heuristic solution to this problem would be to read the files one by one in order to discover all files, and then search for the flag inside each one.
The number of files N that a website hosts has a significant influence on the problem scale. The actual size of the action space |A| depends on the value of N : an agent can take up to 2N different actions, that is, a read() action and a search() action for each file. Moreover, an agent is required to keep track of its own knowledge state, that is record what actions has been executed and what result was observed. A basic agent can simply track, for each file, whether action read() was tried (2 N states) and whether action search() was tried (2 N states). In total, it will have 2 2N −1 states; Table 2 shows an estimate of the number of actions and states as a function of the number of files.

Level2 -Hidden link layer
In level2, we model again the website as a collection of static HTML files. Files are still linked by pointers, but we now distinguish two types of pointers: links that are openly visible to the attacker upon reading the files (as it was in level1), and implicit pointers that requires an actual analysis of the file. Real-world examples of these types of implicit pointers may be: comments in the source code that refers to another file without stating a direct link; keywords used in the file that refer to a special type or version of a webserver app or CMS, and that indicate the existence of other default files; recurrent appearance of a word, suggesting that there may be a file or folder with the same name. Practically, level2 problems can be represented as directed typed graph of files with two types of edges (see Figure 3). The set of actions of the agent is composed now of three parametric actions A = {read (file i ), search (file i ), deepread (file i )}. As before, action read(file i ) read the i th file and returns a list of files connected by an explicit link, while search(file i ) checks the i th file for the presence of the flag. The action deepread(file i processes the i th file and returns a list of files connected by implicit links. See Table 3 for a summary of the actions, their parameters, and their return values. Notice that, at this level of abstraction, the logic and the algorithm for performing a deepread() is implicitly provided in the game itself. At higher levels of abstraction, the task of actually parsing a HTML file and uncover the possible URLs of new files would be delegated to the learning agent; such an agent would receive the actual content of a file and it could use a range of algorithms to process the text, from simple dictionary mapping (e.g.: apache mapping to cgi-bin, wordpress mapping to wp-login, etc.) to more complex natural language processing neural networks able to propose new potential file candidates. Given N files on the webserver, the cardinality of the action space is now |A| = 3N and the cardinality of agent state space is 2 3N −1 , by trivially scaling up from level1 because of an additional action. Table 4 shows estimates for few values of N .

Level3 -Dynamic content layer
The real complexity of a website starts with server-side scripting. In level3 we consider a webserver that can dynamically execute server-side scripts by processing user parameters and generating static content for the client. A single web file can provide multiple results based on the parameters that the site receives from the client. We still model the webserver as a collection of static files, delegating the complexity of dynamic serverside scripting in the space of actions. From a practical perspective, the webserver can still be seen as directed Fig. 4: Example of webserver at level3 Solid nodes represent files, dotted nodes within a file illustrate a pair of parameter name and value that may be sent to a file, solid arrows and dashed arrows represent respectively direct and indirect connections between files given a parameter pair.
If an arrow leads to a file, it means that upon a successful read() or deepread() action the file itself is revealed without parameters; if an arrow leads to an internal dotted node, then after a successful read() or deepread(), a file together with a parameter list for the file is also sent back to the agent.
typed graph with nodes that may return different values depending on the received parameter (see Figure 4). In order to account for parameter passing, we now define a new set of parametric actions: A = {read(file i , pname j , pval k ), search(file i , pname j , pval k ), deepread (file i , pname j , pval k )}. Actions have the same semantics as in level2, but now, beyond receiving file i as an input parameter, they also receive parameter name j and parameter value k. This reflects the request of a specific URL (file i) together with a specific parameter (parameter name j) and a set value (parameter value k). The return value of the read() and deepread() actions is also enriched by a possible set of parameter names and values; this is due to the fact that the answer of the webserver may contain not only links to other files, but it may include the specific parameter pairs relevant to the connected files. See Table 5 for a summary of the actions, their parameters, and their return values. Notice that at this level of abstraction, we assume that only a single pair (pname j , pval k ) can be specified as input; moreover, to keep the complexity in check, we assume that pname j and pval k may assume values in a finite set, that is 1 ≤ j ≤ M and 1 ≤ k ≤ O, M, O ∈ N ≥0 .   Table 6 shows some estimates for different values of N , M , and O.

Level4 -Web method layer
In level4 we scale the complexity in an effort to make the problem more realistic. We now consider the possibility of a webserver receiving a request specifying a web method and containing a list of parameter names and an associated list of parameter values. This better capture the actual dynamics of the HTTP protocol, reflecting the syntax of common HTTP methods such as GET and POST. The webserver is always modeled as a collection of files forming a directed typed graph with nested nodes (see Figure 5). The set of parametric actions is now restructured. We drop the previous artificial distinction between read(),  Table 7 for a summary of the actions, their parameters, and their return values. Given, as before, N files on the webserver, M possible alternatives for the parameter names, O possible alternatives for the parameter values, the cardinality |A| depends on the maximum length P of the list of parameters. With P = 0, |A| = 2N , that is, trivially, get() and post() actions with no parameter on each file. With P = 1, |A| = 2N + 2N M O, that is, the same two actions for every possible combination of zero or one parameter name and value (similar to level3). In the worst case in which P = M , that is the list can be long enough to contain all the parameter names, the number of possible actions could be estimated as:

Level5 -HTTP header layer
While all the previous layers considered only the URL and the body part of the HTTP packets, level5 takes the HTTP header into consideration as well. The HTTP header can contain relevant information such as the session variables or the web response code in the response header. The session, which is composed by a session variable name and value (e.g., JSESSIONID=Abvhj67 ), is used to provide elevated access to special users; a practical example is the login process (which may happen by sending multiple POST parameters, as modeled in level4), after which the server sets a new session value. Additional HTTP header information, such as the browser type or the character encoding, can also have an effect on the response provided by the webserver.
We always model the webserver as a collection of files forming a directed typed graph with nested objects (see Figure 6). Object access is now more complex as it depends also on all the header variables. This complexity may in part be reduced by considering a pair of session name and session value as a single parameter (session values are usually random numbers with high entropy so there is no point in handling the session variable name and value separately unless the session values are predictable and the attacker wants to brute-force the session value), and by limiting the number of allowed session pairs and HTTP headers. Under this simplification, we preserve the same actions as level4, but we extend the signature of their input parameters: ) and a HTTP header (header). The result of these actions is a web response, possibly together with a HTTP page. The web response code (e.g., 200, 404, 500) reflects the accessibility of the requested object. As before, the flag is considered retrieved when the agent obtains the HTTP page containing the flag. See Table 8 for a summary of the actions, their parameters, and their return values.
With reference to the actions we have defined, we observe an enlargement of the action space that now depends on the number N of files on the server, the number M of parameter names that can selected, the number O of parameter values available, the number P of parameter pairs that can be sent, the number Q of session pair values available, the number R of session pairs that can be sent, and the number S of HTTP header without cookies that can be sent. Figure 6 provides also the illustration of a possible interaction between the agent and the webserver. The attacker first tries to log in using an invalid password, which actually reveals a new version of the login.php file by redirecting the page to the index.php page without session. Using the right credentials shows another version of the login.php page that instead redirects the user to a version of index.php with the session pair ses-sionpair1. This version of the index.php leads then to another version of the file (logout action) that is connected to the original version of index.php without session.

Level6 -Server structure layer
In a complex web hacking scenario, the attacker may map the file system of the server in order to collect information to be used during the attack. In level6 we extend the formalization of the webserver in order to consider not only files within the webroot, but also objects beyond it, such as local files and databases. This extension allows to simulate attacks relying on local file inclusion (LFI) vulnerabilities, or information gathering attacks on a database in order to set up a SQL injection. Figure 7 shows the structure of a webserver, and it illustrate a possible LFI attack to obtain the webserver logs or the environmental variables. While the action set remains the same as level5, the extension of the domain of the objects beyond the webroot escalates the number of targets that the agent may consider. Complexity soars with the increase of objects, including databases, and, within a database, its tables, columns and rows.
Level6 abstraction provides the agent the following additional features compared to lower level of abstractions: -Obtaining the local resources of the website such as the background files if there is any or the background database records used for the website operation. The attacker can use these data for the attack with the normal requests covered in lower layers; -Accessing the data in order to compromise other websites residing on the same webserver; -Obtaining the webserver files that are used for other purposes than the website operations, such as users Fig. 7: Example of webserver at level6 Solid nodes represent files, dotted nodes within a file illustrate possible lists of parameter name and value pairs and session name and value pairs that may be sent to a file via a webmethod, solid arrows represent connections between files given parameters and sessions. Dotted boundary lines separate different logical spaces, such as the webserver space and the database space. Dashed arrows mark connections between these logical spaces.
data, other service data, operating sytem data and use these for the attack.
In this scenario the access rights of the objects play an important role; running a webserver as a root can have serious consequences, while having minimum access rights reduce the chance of such exploitations. Notice, though, that practically, from the point of view of the agent, there is no difference between the cases when an object is not present or the object is present but there is no read access for the object by the website.

Level7 -Server modification layer
The last layer we consider in our Agent Web Model is the server modification layer. In this layer we assume that the agent can carry out complex meaningful web hacking actions such as creating its own objects, either inside or outside the web root. With the ability to create its own files, the attacker can place command scripts that can be used to carry out advanced attacks. Figure  7 show the same structure of the server as in level6, Fig. 8: Example of webserver at level7 Solid nodes represent files, dotted nodes within a file illustrate possible lists of parameter name and value pairs and session name and value pairs that may be sent to a file via a webmethod, solid arrows represent connections between files given parameters and sessions. Dotted boundary lines separate different logical spaces, such as the webserver space and the database space. Dashed arrows mark connections between these logical spaces. Boldface objects represent objects created by the attacker. and it illustrates an attacker creating its own files on the webserver.
Attacking actions leading to the creation of objects can be carried out by the web requests that we have already considered. The action does not change, but the domain of the parameters increases in order to allow for more sophisticated actions.
Level7 abstraction provides the agent the following additional features compared to lower level of abstractions: -Causing denial of service by editing important objects for the site operation; -Defacing the site by changing the site content; -Escalating privileges by adding data to objects; -Uploading attack scripts to provide extra functions for the attack; -Removing attack clues by deleting log files, deleting temporary files that were used for the attack.
Level7 is assumed to be the highest level of modelling, capturing all relevant features of hacking; thus, solving this challenge is extremely hard, and we would expect that a successful agent would perform as well as, or better than, a professional human hacker actually involved in a process of website hacking.

Modelling web vulnerabilities
In this section we analyze how different types of web vulnerabilities fit within our Agent Web Model. Refer to Table 9 for a summary of which vulnerabilities may be modeled at which level.
Information disclosure is a type of vulnerability where the attacker gains useful information by penetrating the system. Evaluating the usefulness of the gained information is not trivial, but through the CTF formalization we make the simplifying assumption that relevant information the attacker may be interested into is marked by the flag. In this way, it is possible to equate successful information disclosure with the retrieval of the flag. Every level of abstraction in our Agent Web Model captures this attack: in level1 sensitive information (flag) is in a public linked file on the webserver; in level2 sensitive information (flag) can be inside a private file; in the following layers (level3 to level5) sensitive information (flag) can be accessed using special parameters or sessions; in level 6, sensitive information (flag) can be inside a file outside the webroot.
Web parameter tampering [16] is a type of attack where the web parameters exchanged by the client and the server are modified in order to have access to additional objects. Our Agent Web Model captures this attack starting at level3 by allowing the specification of web parameters in the URL; in level4 it is possible to add HTTP body parameters (POST message); in level5 it is possible to edit cookies in the HTTP header. In all these instances, an agent can perform web parameter tampering either by meaningfully exploring the space of possible values of these parameters, or by trying to brute-force them.
Cross Site Scripting (XSS) attacks [13] enable attackers to inject client-side (e.g. javascript) code into the webpage viewed by other users. By exploiting a XSS vulnerability the attacker can overwrite the page content on the client side, redirect the page to the attacker's page, or steal the valid sessions inside the cookie. All these offensive actions can be followed by some social engineering trick in case of a real attack. In the context of CTF style challenges where additional clients are not available, the aim of an attacker is simply to show the existence of the vulnerability. A flag may be used to denote a page that is only accessible indirectly by redirection. The task for the agent is to find the right parameters to achieve the redirection. The injected client-side code for XSS has to be sent through web parameters. XSS attacks can be simulated in our Agent Web Model as soon as we can interact with parameters: in level3 the attacker may add code in the URL; in level4 the attacker may modify POST parameters; in level5 the XSS attack may affect the header.
Cross Site Request Forgery (CSRF) [29] is a type of vulnerability where the attacker sends a link to authenticated users in order to trick them to execute web requests by social engineering. If the users are authenticated (have sessions) the malicious request (e.g., transferring money, changing the state) is executed by the server. This exploitation is based on social engineering and on misleading the user. In addition, CSRF tokens are sent by the server to filter out unintended requests; the agent can check the existence of appropriate CSRF tokens or exploit requests with weak CSRF tokens. In our model the CSRF attack has to be simplified to consider only the CSRF token manipulation in layer 5.
SQL injection [1] is a vulnerability where malicious SQL statements can be executed by the server due to the lack of input validation on the server side. By modifying the original SQL statement of a server-side script the attackers can bypass authentication, access confidential database information or even write attack scripts on the server (select into outfile command). In most of the cases the attacker has to map the database structure of the target by finding, for instance, the different table names along with their column names and types. In our Agent Web Model this attack can be completely simulated at level6, although other simplified versions may happen at lower levels. In the easiest case the agent only need one dynamic parameter without sessions; bypassing only a simple authentication or collecting data from the same table that the server-side script uses does not require to know the table name and other database structure data; in these cases, a basic form of SQL injection may be simulated even in level3. Complex cases comprising all the database parameters need to happen at level6. If the attacker uses the SQL injection to carry out further actions such as writing attacking scripts on the compromised site, then this has to happen at level7. All the above mentioned cases require a very high number of actions especially when the agent has to execute a Boolean-based blind SQL injection. In these cases, the vulnerable application provides only true or false answers, so obtaining one single piece of information, such as a column name in a table, requires binary search type requests for each character, which can lead to an exponential number of actions. Notice that the Agent Web Model abstraction does not consider the response time of the environment.
In very specific cases such as time-based blind SQL injections, the attacker may have to measure the response time; this type of exploitation would require the consideration of the server reaction time too.
Xpath injection [4] is a web vulnerability where the attacker injects code into the web request, but the target of the attack is not a database (as in the case of SQL injection) but an XML file. By exploiting Xpath injection the attacker can iterate through XML elements and obtain the properties of the nodes one by one. This operation requires only one parameter, so simulating Xpath injection is theoretically possible in level3. Since the exploitation of the Xpath injection does not require the name of the XML file, mapping the files outside the webroot is not necessary even if the XML file is outside the webroot. On the other hand, the vulnerable parameter can be a POST parameter (level4) or it can require a specific session (level5).
Server-Side Template Injection (SSTI) [18] is a vulnerability where the attacker uses native template syntax to inject a malicious payload into a website template. For the exploitation the agent has to use additional actions that are SSTI specific, such as sending a string like ${7*7} together with a parameter. Theoretically, an easy SSTI vulnerability can be exploited in level3, but all other layers above can be used to represent specific attack cases (vulnerable parameter in POST on level4, session required for exploitation on level5); in particular cases, the attacker can list the server structure (level6) or can create files with arbitrary code execution (level7).
File inclusion [17] makes the attacker capable of including remote or local files by exploiting a vulnerable web parameter on the website. In case of remote file inclusion (RFI), the attacker can include its own remote attacking script in the server-side script. Remote file inclusion can have very serious consequences, but in a CTF challenge the aim is just to show the possibility of the exploitation, not to carry out an actual exploit. RFI can be realized by providing a remote file that sends the flag if the request is initiated from the target website IP. Exploiting RFI is possible in level3 but other parameters, such as POST request and sessions, can be relevant (level4 and level5). As a consequence of the RFI vulnerability the attacker can create files on the website for further attacks. In case of local file inclusion (LFI), the attacker can include local files in the serverside script. For the exploitation one single parameter is theoretically enough, but since usually it is necessary to read local files outside the webroot, the agent has to map at least a part of the server structure (level6). In some exploitation scenarios the attacker can use local files (such as logs or files in the /proc linux folder) to create its own files on the server (level7).
Session-related attacks [36] exploit session disclosure or other weaknesses in the session generation process. Since we model the environment as the server itself without other network nodes, man in the middle session disclosures cannot be considered. Other session disclosures can be possible, for instance, if the sessions are stored in the logs and the website can access the log files (LFI), as modeled on level6. Brute-forcing the session is also possible in level5, but brute-force actions increase dramatically the complexity and the number of possible actions.
HTTP response splitting [15] is a vulnerability where the attacker can control the content of the HTTP header of a web request. The ability of the attacker to construct arbitrary HTTP responses can result in many other exploits such as cache poisoning or Cross Site Scripting. Our Agent Web Model considers the HTTP header information in level5, but only with limited information (different session pairs and the whole header together with different versions). Training the agent to learn HTTP response splitting exploitation would require to split the HTTP header in multiple parts and allow the agent to consider actions on different HTTP header combinations. An implementation of the first three levels of the Agent Web Model has been developed in agreement with the standard defined in the OpenAI gym framework [6], and it has been made available online 1 . Each level provides a simple interface to an abstraction of a CTF challenge; a webserver makes available to the agent a finite set of actions, and it returns information about the state of the game upon the choice of an action. Levels may be instantiated parametrically (deciding the number of files, the links, and the possible parameters), thus offering the possibility of generating a wide variety of challenges for a learning agent. The first three levels already offer a wide degree of challenge: while level1 provides a simple, tutorial-like, CTF game, the third level constitute a non-trivial abstraction of a real hacking challenge. By adopting the standardized OpenAI gym interface, we hope to make it easy for researchers and practitioners to test their agents and algorithms on our challenges. In particular, we hope to simplify the process of deploying and training off-the-shelf RL agent, as well as provide interesting problems that may promote the development of new learning algorithms.

Ethical considerations
RL agents trained for ethical penetration testing carry with them the potential for malicious misuse. In particular, the same agents may be deployed and adapted with the aim of generating material or immaterial damage. We would like to repeat that the aim of the current study is to develop agents to assist ethical hackers in legitimate penetration testing, and to develop an understanding of RL agents on a preventive ground only. For this reason, we advocate the development of agents in the context of CTF challenges, where the aim is a minimal and harmless exploitation of a vulnerability as a proof-of-concept (capture of the flag), but no further attacks are considered. We distance ourselves and condemn any application of these results for the development of offensive tools, especially in a military context 2 .

Conclusions
In this paper we presented a model, named Agent Web Model, that defines web hacking at different levels of abstraction. This formulation allows for a straightforward implementation of problems suited for machine learning agents. Since the aim and type of web attacks can be various, and different technical and human methods may be involved, we first restricted our attention to CTF-style hacking problems. We then modeled CTFstyle web hacking as a game and as a RL problem. The RL problem considers a single player dealing with a static website consisting of objects with which the agent can interact by sending requests (with or without parameters). We formalized RL problems on 7 different levels of abstraction, ordered by increasing complexity in terms of number of objects, actions, parameters and states. Starting from a simple challenge on the first level of abstraction, we observed the complexity of the problems quickly increasing, thus defining a non-trivial learning challenge for an artificial agent. An implementation of the problems on the first levels of abstraction was provided. The challenges we implemented range in complexity, they allow for customizability, and provide a way to instantiate a large number of random web hacking challenges in a generative way in order to train an artificial agent. Future work will be directed to further developing and further standardizing CTF challenges at higher levels of abstraction, as well as applying state of the art RL techniques to the problems we defined.
It is our hope that the formalization presented in this paper may not only allow for the development of automatic red bots that may help in the task of ethical penetration testing, but also promote the interaction and the research in both fields of machine learning and computer security: helping security expert to define realistic and relevant challenges that meet the formalism of machine learning, and offering to the RL expert stimulating problems that may foster advances in machine learning.