Keywords

1 Introduction

The API is a combination of a set of definitions and protocols, which plays an important role as a channel for data interaction between programs. Modern applications are often developed with many well-defined interfaces to improve the scalability and compatibility of the program. Although the widespread use of APIs has brought great convenience to data access, and thus different terminals can access relevant information in a similar way, the extensive use of APIs has also brought security issues that cannot be ignored. Especially in the modern microservice architecture, each application is subdivided as much as possible, making the security risks faced by APIs more difficult to be detected completely. Effective detection of API security risks ensures that the system is running in good condition.

The access of API is based on the HTTP/HTTPS protocol, so the threat on the Web protocol may extend to API, such as SQL Injection, Broken Authentication, Session Management, Cross-Site Scripting (XSS). It should be noted that these attacks are always implemented by tampering with the parameters in the API. Therefore, to mitigate the threat of API, a key idea is to prevent the parameters from being tampered with. The security community has proposed a variety of approaches to address the security risks of parameter tampering, the most common of which is the rule-based detection. Such methods are often implemented by a lightweight agent that first detects security risks that may be contained in a Web request before a server process it. If a relevant rule is matched and a request is identified as a security risk, the request is filtered out to avoid the server from being affected. Although this detection method is simple to implement and efficient, its over-reliance on rules that humans preset leads to its inability to detect unknown new attacks, and thus not only has a poor performance in actual detection but also has a high false-negative rate.

Deep learning model has achieved remarkable success in various natural language processing (NLP) tasks, and it has been shown that these models can effectively learn the data distribution, which is difficult for the rule-based detection method to do. By learning the data distribution, the detection of parameter tampering can thus be seen as a pattern classification problem whose goal is to distinguish the feature pattern of the normal access of API and the malicious access of API. However, deep learning methods always need to learn from a large amount of data, which may be difficult to obtain. Furthermore, normal access is more common than malicious access, and the ratio of normal accesses and malicious accesses may be significant unbalance. The unbalance data also makes the model difficult to learn the data distribution, as the model may mainly focus on the type of data that is more common in the dataset and ignores the less one.

In this paper, to detect the parameter tampering attack against API and reduce the influence of unbalanced data, we propose the Context-based Malicious Parameter Detection (CMPD) framework. CMPD improves the effectiveness of detecting malicious parameters by learning the distribution of each component of the API and builds the relationship amount URL, parameter names, and parameters. Experiments show that CMPD outperforms all baseline on CSIC 2010 dataset, with \(F_1\) value reaching 0.97. CMPD also achieves 0.91 \(F_1\) value on the unbalance dataset that normal access data is 100 times more than the malicious data, and achieves 0.89 \(F_1\) value when training data of CSIC 2010 dataset are reduced to 20%.

We summarize our main contributions as follows:

  • We propose a semantic extraction and learning module to learn the relationship amount URL, parameter names, and parameters, which universally models the parameter distribution of different APIs in one framework.

  • We propose the Context-based Malicious Parameter Detection (CMPD) framework, which can effectively detect parameter tampering attacks against API based on the context information indicated by the distribution of the parameter.

  • Experiments on the CSIC 2010 dataset show that CMPD outperforms other baselines, including rule-based methods, Support Vector Machine (SVM), and Autoencoder, and achieves competitive results on unbalanced data and reduced data.

2 Related Work

2.1 Vulnerability Detection for APIs

To detect API’s vulnerability, current methods mainly focus on black-box testing, which needs to generate a large number of testing cases. Much research relies on crawlers or manual methods to get the detection object, parse out the fuzz domain based on the detection object to generate test cases, and use the attack pattern library to perform vulnerability detection [1, 3, 4]. To generate test cases, various methods are proposed. Atlidakis et al. [2] propose REST-ler to automatically generate test requests with a random walk algorithm. Avinash [12] et al. proposed six attack patterns for replay attacks to automatically generate test cases. Douibi et al. [5] automatically generate test cases for REST API based on the description of Swagger and OpenAPI. Because the crawler-based API vulnerability detection method has the problem of low coverage and manual testing can not be carried out on a large scale, the black box testing method is often combined with the interface documentation. Yu et al. [16] propose a fuzz system with RESTful API based on SwaggerHub’s development interface and improves the effectiveness of fuzz testing by automatically generating test cases and automatic filtering. Viglianisi et al. [14] generated normal test cases and malicious test cases based on the interface documentation to test the security risk of RESTful API. Different tools have also been proposed to automatically scan API vulnerabilities, such as FuzzAPIFootnote 1, APIFuzzerFootnote 2, boofuzzFootnote 3, and AstraFootnote 4. These tools do not need to obtain source code and interface documents but combine manual and crawler methods to achieve vulnerability detection. When interface documents are available, the tools such as TNT-FuzzerFootnote 5, 42CrunchFootnote 6, and OWASPZAPFootnote 7 can directly extract detection objects from interface documents to achieve vulnerability detection with high coverage.

2.2 Parameter Tampering Detection for APIs

To detect the parameter tampering attacks for APIs, both rule-based methods and learning-based methods are proposed. ModSecurityFootnote 8 develops the OWASP ModSecurity Core Rule Set (CRS), which contains a large number of rules for detecting SQL Injection, Cross-Site Scripting, and HTTP Protocol Violations. Rieck et al. [11] use the n-grams and a similarity measurement to generate new features for anomaly detection. Ingham et al. [6] proposed the Deterministic Finite Automata (DFA) induction method, which uses a heuristic algorithm to detect abnormalities. Ma et al. [8] use machine learning methods including Naive Bayes, Support Vector Machine, and Logistic Regression to learn the distribution of static features to detect attacks. Nguyen et al. [10] use a feature selection algorithm to reduce the dimension of features extracted from traffic, reducing the computational complexity of the learning algorithm. Liang et al. [7] developed an RNN-MLP network to detect malicious accesses, where the RNN contains LSTM and GRU cells, and the MLP follows the RNN. Wang et al. [15] investigated CNN and LSTM and their combination method for malicious detection, which outperforms the traditional methods.

3 Methodology

3.1 Parameter Tampering Attacks Against APIs

API parameter attacks attempt to manipulate parameters transmitted between the client and server in order to alter application data, such as user passwords and permissions, product prices and quantities. This type of data is typically kept in cookies, hidden form fields, or URL query strings and is used to regulate and enhance the functionality of the program. The attack’s success is conditional on integrity and logical validation mechanism faults, and exploiting these errors may result in further implications such as cross-site scripting (XSS) and SQL injection. The tampering of parameters is frequently limited to several essential categories of data: API query parameters, cookies, form fields, and HTTP headers. Specifically, for an API, which is consisted of a basic URL u, a group of parameter names \(\{n_i|i\in N\}\), and a group of parameter \(\{p_i|i\in N\}\). The i-th parameter is integrated with the i-th parameter name. Suppose the server expects to receive a benign query, and for the target u, all possible benign choices of the i-th parameter are denoted as \(\mathcal {P}_i\), all possible benign choices of the i-th parameter name are denoted as \(\mathcal {N}_i\). Therefore, a benign API query for the target URL u can be defined as

$$\begin{aligned} \forall \ i \in N, \ p_i \in \mathcal {P}_i,\quad \textit{and}\quad \forall \ i \in N, \ n_i \in \mathcal {N}_i \end{aligned}$$
(1)

And a parameter tampering attack for the target URL u can thus be defined as

$$\begin{aligned} \exists \ i \in N, \ p_i \notin \mathcal {P}_i,\quad \textit{and}\quad \exists \ i \in N, \ n_i \notin \mathcal {N}_i \end{aligned}$$
(2)

Intuitively, a parameter tampering attack may happen in the following conditions:

  • The adversary tampers the parameters of an API, attempting to make the server process the tampered parameters for malicious purposes such as SQL injection and XSS attacks.

  • The adversary tampers the parameters name of an API, attempting to let the server process the tampered parameter names to achieve malicious purposes such as bypassing verification.

  • The adversary tampers both the parameters and the parameters name of an API. Even if the tampered parameters and parameter names are benign values for another API, they are indeed malicious values for the current API.

3.2 Semantic Extraction and Learning

As parameter tampering attacks have the characteristics of a wide attack surface and large scope of tampering, traditional methods cannot realize the judgment of whether a request has been tampered with in one model. Furthermore, as text information is discrete, traditional methods cannot use the semantic information contained in it. Therefore, we use the Semantic Extraction and Learning Module to learn the distribution relationship among the basic URL in the API, the parameter names, and the parameters and then map discrete text information to high-dimensional continuous space. The general frameworks of the semantic extraction and learning module is shown in Fig. 1.

Fig. 1.
figure 1

Illustration of the semantic extraction and learning module of CMPD.

Specifically, suppose there is a neural network with K layers, and the weights and the bias vectors for each layer can be defined as

$$\begin{aligned} \begin{aligned} \textbf{W}^{(1)}&\in R^{m_{1} \times m_{0}} \quad \textbf{b}^{(1)} \in R^{m_{1} \times 1} \\ \textbf{W}^{(2)}&\in R^{m_{2} \times m_{1}} \quad \textbf{b}^{(2)} \in R^{m_{2} \times 1} \\&\qquad \qquad \cdots \\ \textbf{W}^{(K)}&\in R^{m_{\textrm{K}} \times m_{\textrm{K}-1}} \quad \textbf{b}^{(\textrm{K})} \in R^{m_{\textrm{K}} \times 1} \end{aligned} \end{aligned}$$
(3)

where \((m_0,m_1,\cdots ,m_K)\) is the number of units in each layer. The active function in each layer are denoted as \((f^{(1)}, f^{(2)}, \cdots , f^{(K)})\), and thus the output of the K-th layer \(\textbf{Y}^{(k)}\) can be defined as

$$\begin{aligned} \begin{aligned} \boldsymbol{net}_{i}^{(k)}&=\sum _{i=1}^{m_{k-1}} W_{i, j}^{(k)} Y_{j}^{(k-1)}+b_{i}^{(k)},\left( 1 \le i \le m_{k}\right) \\ \boldsymbol{net}^{(k)}&=\textbf{W}^{(k)} \textbf{Y}^{(k-1)}+\textbf{b}^{(k)}\\ \boldsymbol{net}^{(k)}&=\left[ \text{ net}_{1}^{(k)}, \text{ net}_{2}^{(k)}, \ldots , \text{ net}_{m_K}^{(k)}\right] ^{T} \\ \textbf{Y}^{(k)}&=f^{(k)}\left( \boldsymbol{net}^{(k)}\right) =\left[ Y_{1}^{(k)}, Y_{2}^{(k)}, \ldots , Y_{m_{k}}^{(k)}\right] ^{T} \end{aligned} \end{aligned}$$
(4)

We extract the URL u, the parameter names \(\{n_i|i\in N\}\), and the parameter \(\{p_i|i\in N\}\) in each API query and then arrange them in the order they appear in the query as \((w_t)_{t\in \{1,2,\cdots ,M\}}\). We randomly remove a token \(w_t\) in \((w_t)_{t\in \{1,2,\cdots ,M\}}\), and send the rest token into the network with a look-up layer that is concatenated before the first layer. The look-up layer has the parameter in dimension \(V \times E\), where V is the size of vocabulary, and E is the size of embedding. This module maps the token to continuous values. We expect the network knows what the removed token is and output the probability of the removed word in \(\textbf{Y}^{(k)}\). Turning the training, the probability of the removed word in the output layer, i.e., the K-th layer, is maximized. After training, we use the value in the look-up layer as the embedding of a token and the average result of each token in an API query as the embedding of the API. In this high-dimensional continuous space, the representation of tokens indicates the relationship between tokens so that subsequent modules can effectively use the semantic information in the token.

3.3 Detection on Parameter Tampering

To reduce the reliance of model on the amount of data and to enable it to learn effectively when the positive and negative samples are not balanced, we additionally classify the API embedding using a decision tree model.

A decision tree model is a tree structure that describes how instances are classified, and it is composed of nodes and directed edges. Nodes are classified into two types: internal nodes and leaf nodes. Internal nodes denote a property or attribute, while leaf nodes denote a class. Begin with the root node and test a specific feature of the instance; then, using decision tree classification, assign the instance to its child nodes based on the test results; at this point, each child node corresponds to a value of the feature. Recursively, instances are tested and allocated until a leaf node is reached. Specifically, suppose the training data consisting of all the API embedding is D, A is the feature group, \(C_k\) is the samples of class k, the dataset can thus be separated into \(D_1,D_2,\cdots , D_n\). Denote the samples in \(D_i\) and in class \(C_k\) as \(D_{ik}\), the entropy of the dataset D can be calculated as

$$\begin{aligned} H(D)=-\sum _{k=1}^{K} \frac{\left| C_{k}\right| }{|D|} \log _{2} \frac{\left| C_{k}\right| }{|D|} \end{aligned}$$
(5)

and the conditional entropy of D given A is

$$\begin{aligned} H(D \mid A)=\sum _{i=1}^{n} \frac{\left| D_{i}\right| }{|D|} H\left( D_{i}\right) =-\sum _{i=1}^{n} \frac{\left| D_{i}\right| }{|D|} \sum _{k=1}^{K} \frac{\left| D_{i k}\right| }{\left| D_{i}\right| } \log _{2} \frac{\left| D_{i k}\right| }{\left| D_{i}\right| } \end{aligned}$$
(6)

and the information gain of D from A is defined as

$$\begin{aligned} g(D, A)=H(D)-H(D \mid A) \end{aligned}$$
(7)

During the training, the attribute with the largest information gain rate is selected as the test attribute each time, and the construction of the decision tree is completed from top to bottom. The Parameter Tampering Detection Module in CMPD is consisted of the well-learned decision tree.

3.4 Context-Based Malicious Parameter Detection Framework

Based on the previous analysis, we now illustrate the general architecture of the proposed Context-based Malicious Parameter Detection (CMPD) framework, which is shown in Fig. 2.

Fig. 2.
figure 2

General architecture of CMPD.

The CMPD framework is consisted of the semantic extraction and learning module we detailed in Sect. 3.2, and the parameter tampering detection Module we detailed in Sect. 3.3. We first collect all the API access records in the form of URL requests and feed all of the collected data into the semantic extraction and learning module. Therefore, the API request will be map to the high dimensional hidden space in the form of API vector representation, which contains the context information about the normal and abnormal parameters. We then collect all the API representations and feed them to the parameter tampering detection module, which can complete the malicious parameter detection without relying on the balanced and numerous data. The detection module will classify each API representation into a benign request or an abnormal request from the root to leaf nodes of the decision tree, as we illustrated in Fig. 2.

4 Experiments

4.1 Metric

The experiments will focus on the identification of the parameter-tampering attacks, and the evaluation metrics used in the current work include precision, recall, \(F_1\). These metrics are calculated using the proportion of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in the classification results. TP and TN are the number of correctly classified malicious and legitimate API requests. FP is the numbers of normal API requests misclassified as malicious, while FN is the number of abnormal requests misclassified as legitimate API requests. Where the precision is calculated as

$$\begin{aligned} \text{ Precision } =\frac{T P}{T P+F P} \end{aligned}$$
(8)

the recall is calculated as

$$\begin{aligned} \text{ Recall } =\frac{T P}{T P+F N} \end{aligned}$$
(9)

the \(F_1\) value is calculated as

$$\begin{aligned} F_1=\frac{2 \cdot \text{ Precision } \cdot \text {Recall}}{\text{ Precision } + \text {Recall}} \end{aligned}$$
(10)

4.2 Main Result

The results of different methods on the HTTP DATASET CSIC 2010 are shown in Table 1. CRS stands for Core Rule Set, and PL stands for Paranoia Level, which is used to control the strictness of ModSecurity’s rule checking, with a smaller value indicating greater strictness. As the PL increases, the Precision value of ModSecurity increases while the Recall value decreases, resulting in a decrease in the \(F_1\) value, indicating that the traditional method based on rules has a very limited effect. The effect of the SVM algorithm on detecting parameter tampering is weaker than that of the traditional method, most likely because the SVM algorithm is extremely dependent on the quality of the features, and the features fail to indicate the distribution of data. Autoencoder are deep learning-based methods that perform better than traditional methods. The proposed CMPD outperforms all baselines, including traditional detection methods and learning-based methods, in terms of \(F_1\). CMPD has a balanced precision and recall, indicating that our method has low false-positive and false-negative rates.

Table 1. The comparisons on Precision, Recall, and \(F_1\) Score.

4.3 Further Analysis

Influence of the Number of Training Data. In practical situations, the samples of normal and abnormal accesses may be extremely unbalanced. To explore the effect of our model in a more demanding environment, we conducted experiments in two ways. We first reduce the number of training data to illustrate the performance of CMPD when training data is not enough. The influence of the number of training data is shown in Fig. 3. We find that the classification performance of the model gradually increases with the increase of training data, and the classification results are consistent with the results in Table 1. When the complete training data set is used, CMPD achieves the best classification performance. Moreover, when the training data is only 20% of the original dataset, the \(F_1\) value can also achieve 0.89, indicating that CMPD is less sensitive to the amount of training data and is effective even with fewer data.

Fig. 3.
figure 3

Influence of the percentage of training examples.

Influence of the Ration Between the Number of Negative Examples and Positive Examples. Further, we randomly drop the samples of malicious queries in the dataset, and the performance of our method are shown in Fig. 4. We find that when the number of malicious samples decreases, the performance of the classification decreases accordingly, but even when it is reduced to 1% of the normal samples, the \(F_1\) value can still reach above 0.91, indicating that our model is effective even when the normal and malicious samples are extreme imbalance.

Fig. 4.
figure 4

Influence of the ration of negative examples/positive example.

Visualization of Model Concentration. Further, we collect the parameters and the parameter names that have a critical impact on the parameter tampering detection extracted by the model, and the visualization of these tokens is shown in Fig. 5. It can be seen that parameters such as email, login, and password, which are highly relevant to parameter tampering attacks, are correctly extracted and are considered to be of high importance, indicating that the model has different levels of attention to different parameters and successfully learns the features related to parameter tampering.

Fig. 5.
figure 5

Visualization of the parameters and parameter names that the model most concentrated on.

Table 2. Case study on different types of tampering

Case Study. To show that our method can identify the tampering with parameters, the tampering with parameter names, and the correspondence of URL and parameters or parameter names, we provide the results of case study in Table 2. As we illustrated in the Table 2, if we tamper with the parameter of the API “http://localhost:8080/tienda1/publico/entrar.jsp” and add the string “%11” behind the normal parameter “errorMsg=Credenciales+incorrectas”, the CMPD framework will find that the parameters of the API are tampered. Similarly, if we tamper with the parameter name (from errorMsr to errorMsgBAC), CMPD still successfully detects the tampering, which shows that our method learns the correct relationship between the parameters and the parameter names. Furthermore, if we change the parameter of the “http://localhost:8080/tienda1/publico/entrar.jsp” to the normal parameter of another API, “http://localhost:8080/tienda1/publico/vaciar.jsp”, our method can also detect that the parameter and parameter names do not correspond to the correct URL. It is shown that CMPD successfully learns the correspondence between URLs, parameters, and parameter names, and we do not need to use different models to detect the parameter tampering attack for different APIs.

5 Conclusion

APIs are vital for data exchange between programs, but their widespread use has brought significant security risks. By modifying API parameters, the adversary can launch Web attacks such as SQL Injection and Cross-Site Scripting (XSS). API parameter tampering detection is critical to keep the system running smoothly. To detect parameter tampering attacks, previous works always used rule-based or simple learning-based methods, while they ignore the API tokens’ contextual information and thus perform poorly. In this paper, we propose a framework for detecting API parameter tampering attacks called Context-based Malicious Parameter Detection (CMPD). We first learn the distribution of parameters, parameter names, and URLs using a neural network language model and then use a tree model to detect malicious queries based on the high-dimensional API embedding. On the CSIC 2010 dataset, CMPD outperforms all baseline with \(F_1\) of 0.971.