Keywords

1 Introduction

In recent years, e-commerce platforms have shown a trend of peak traffic dividends. Each platform will provide marketing activities to win customers and improve user stickiness. The process also give birth to the econnoisseur who exploit vulnerabilities in platform activity to profit.

According to the analysis report on the application of digital financial anti fraud Technology (2021) released by Institute of Cloud Computing and Big Data of China Academy of Information and Communications Technology and the ICBC Security Attack and Defense Laboratory, it is found that the total loss caused by the anti-fraud of black industry(see Fig. 1), showing an increasing trend every year, and the loss is expected to reach 710 billion yuan in 2022, the econnoisseur accounts for a large proportion. In 2019, Pinduoduo was robbed of tens of millions of yuan by the econnoisseur within a few hours because of an expired coupon bug on the platform. In 2021, Jingdong Mall was discovered and spread by the econnoisseur due to the wrong coupon setting, resulting in a direct loss of nearly 70 million yuan. On the one hand, the existence of the econnoisseur damages the profits of ordinary users, on the other hand, it also greatly reduces the company's activities, and its governance is urgent.

Fig. 1.
figure 1

Fraud losses and forecasts as a percentage of GDP

2 Related Work

The application of anomaly detection in the field of cyber security focuses on APT detection, intrusion detection and so on.

APT attack is a hidden and persistent network intrusion process, which carries out advanced persistent threats against specific targets. Bohara [1] compared various unsupervised algorithms such as K-means for APT detection and found that it can detect infected hosts. Zhong Yao [2] performed anomaly detection on traffic log data based on Isolated Forest and found that it has certain detection ability against APT attacks and can mark the suspected infected hosts.

Intrusion detection is a system that detects intruders in a network. The detection methods can be divided into supervised and unsupervised based machine learning algorithms. References [3,4,5,6] are mainly based on supervised algorithms such as Naive Bayes, Bayesian Networks, Hidden Markov Models, and ensemble learning for intrusion detection. This detection method requires a large number of labeled sample data, but there will be insufficient sample label data in many scenarios. In literature [7,8,9,10,11], unsupervised algorithms such as K-means clustering, hierarchical clustering and DBSCAN are used for intrusion detection, which has good detection ability.

At present, the research of econnoisseur detection is still in the theoretical stage, the engineering is mainly based on traditional threshold setting or rule-based interception. Yuan Dandan [12] based on the community discovery algorithm, identified the econnoisseur with similar characteristics into groups. When the econnoisseur characteristics change this method fails.

The challenges of e-commerce econnoisseur identification are as follows: 1. Rule omission. at present, most platforms intercept econnoisseurs based on rules, but the detection will be missed when econnoisseur behavior changes. 2. Insufficient sample labels. The econnoisseur is newly added every day, and manual labeling requires a large labor cost. 3. Model detection lag. With the iterative update of the econnoisseur’s method, the existing rules and models have a certain lag, so it is necessary to periodically iterate and maintain the model.

3 Theoretical Basis

Anomaly detection can be divided into supervised and unsupervised anomaly detection. Since the econnoisseur is generated in real-time, unlabeled sample data is mainly used in practical applications, this paper uses the unsupervised anomaly detection scheme. Considering the different applicability of detection methods in different scenarios and data, this paper compares three commonly used anomaly detection schemes to see their ability to identify econnoisseur in e-commerce website log data.

3.1 Isolated Forest (IForest)

The isolation forest model was first proposed by Zhou Zhihua's team and Fei Tony Liu of Monash University [13] as an ensemble learning method. It was used in the field of industrial anomaly detection due to the advantages of high accuracy and linear time complexity. The theoretical basis of the model are: 1. There are differences between abnormal data and normal data. 2. The proportion of abnormal data is relatively small. These two theories are consistent with the econnoisseur detection.The Isolation Forest algorithm cuts the data space through a random hyperplane. The data plane can divide the data into two subspaces at a time, until each subspace has only one sample point or reaches the given height of the tree.

The Isolated Forest needs to be trained on the Isolated Tree first to obtain the Isolated Forest. After that, calculated the isolated score S of each test sample, then compare the difference between isolated score S and the given threshold to see whether the sample is an abnormal sample.

figure a

After data training to obtain an Isolated Forest, the anomaly score of the test sample can be evaluated based on the generated Isolated Tree. Since the structure of the Isolation Tree is consistent with the binary search tree (BST), the average path length of the tree is consistent. Based on this, BST is used to estimate the average path length of isolated tree.

$$ c(n) = 2H(n - 1) - ({{2(n - 1)} / n}) $$
(1)
$$ H(n) = ln(n) + \rm{0}\rm{.5772156649} $$
(2)

Formula (1) is the average path depth of the isolated tree composed of n samples, and it is used to standardize the depth of the samples on the Isolated Tree, so that the abnormal score of the test sample x is shown in Formula (3).

$$ s(x,n) = 2^{ - \frac{E(h(x))}{{c(n)}}} $$
(3)
$$ E(h(x)) = \sum_{i = 1}^t {{{h_i (x)} / t}} $$
(4)
figure b

3.2 Local Outlier Factor (LOF)

The Local Outlier Factor [14] is a model that determines whether the sample points are abnormal based on the density. Its core concept is that the density of abnormal points is smaller than that of other points. Before introducing the Local Outlier Factor model, we need to understand some basic concepts.

Definition 1:

(k-distance). For point p, sort the distances between point p and other points from small to large, and the k-th closest distance point to point p is k-distance of point p. If point o is the k-th point closest to point p, then distance is k-distance of object p, i.e.

$$ k\_distance\left( p \right)\, = \,d\left( {p,o} \right) $$
(5)

Definition 2:

(k-distance neighborhood). Draw a circle with point p as the center and k-distance as the radius. The points in this circle is the k-distance neighborhood of p, i.e.

$$ N_k (p) = d(p,o^{\prime}) \le d_k (p) $$
(6)

Definition 3:

(reachability distance). Take point o as the center, and take the maximum value of the k-th distance nearest to point o, then the distance is the reachable distance from point p to point o.

$$ reach\_dist_k (o,p) = max\{ d_k (o),d(o,p)\} $$
(7)

Definition 4:

(local reachability density). The reciprocal of the average reachable distance in the neighborhood of point p is the local reachable density of point p, defined as

$$ lrd_k (p) = \frac{1}{{\frac{{\sum o \mathrel\backepsilon N_k (p)\,reach\_dist_k (p,o)}}{{\left| {N_k (p)} \right|}}}}. $$
(8)

Definition 5:

(local outlier factor). The mean of the local reachability density of points in the field divided by the local reachability density of point p is the local outlier factor of point p, defined as

$$ LOF_k (p) = \frac{{\sum o \mathrel\backepsilon N_k (p)\frac{lrd(o)}{{lrd(p)}}}}{{\left| {N_k (p)} \right|}}. $$
(9)

Algorithm steps:

  1. 1.

    Calculate the distance between each sample point and all other points, sort them from near to far.

  2. 2.

    For each sample point, find the point in its k-distance field, and then calculate its LOF score.

  3. 3.

    Given the threshold, if the LOF value of the sample point is higher than the threshold, the sample point is an abnormal point.

Therefore, the algorithm calculates the density of the samples based on the local points in the k field of the sample points. The lower the density, the greater the probability of abnormal samples.

3.3 DBSCAN

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) [15] is a density based spatial clustering algorithm.

The concepts involved in the algorithm are:

Definition 1:

(Core user). For sample point p, give a distance ε, if there are at least Minpts sample points within ε neighborhood, then p is the core point. For point p, its density is defined as \(\rho (p) = \left| {N_\varepsilon (p)} \right|\). Where \(N_\varepsilon \left( \cdot \right)\) denotes the set of points in its ε neighborhood. If p is a core user, then defined it as.

$$ \rho (p) = \left| {N_\varepsilon (p)} \right| \ge Minpts $$
(10)

Definition 2:

(Directly density-reachable). If point p and point q is directly density-reachable, the following two conditions must be satisfied:

i) p is in the ε neighborhood of a core point q \(p \in N_\varepsilon (q)\). . ii) q is core user \(\left| {N_\varepsilon (q)} \right| \ge Minpts\).

Definition 3:

(Density-reachable). If there is a point o so that both p and q can be directly density-reachable, then the point p and q densities is density-reachable.

Definition 4:

(Border user). For sample point p, if the sample points included in the ε radius are smaller than Minpts and the sample is in the field of other core points, sample p is the border user.

Definition 5:

(Noise user). If sample point p is of non core user and border user, then it is names noise user.

figure c

4 Research Process

4.1 Data Sources

The data is based on the real-time streaming log data in the Knownsec Security Intelligence Brain. The fields in the log data are: access time, access IP, user agent, URL link, website domain name, etc. Three e-commerce websites log data was screened, an IP and user agent was regarded as an independent visitor of the website.

Log data with abnormal access URLs or few visits is deleted to reduce interference to the model. The data of three e-commerce websites in November 2021 are sampled for observation to see their performance in one month. The website visit situation is shown in Table 1 :

Table 1. Website visit number.

4.2 Feature Construction

After analyzing the access data of some users, it is found that the econnoisseur can be divided into two categories: 1. Monitoring users, who monitor the preferential information of commodities in real-time and at low frequency. 2. Activity type users, who make high-frequency application and purchase of goods with large discounts during the activity period.

Combing with the characteristics of the econnoisseur, we constructed three characteristics for users: website visits, website visit time and different website visits, totally nine features are shown in Table 2.

Table 2. User features

4.3 User Behavior

Reduce the user feature data of website C to 2D for visualization based on t-SNE (t-distributed Stochastic Neighbor Embedding), as shown in Fig. 2. User data can be divided into four categories according to the color of points. Some sample data were extracted from the four categories and analyzed. It was found that the econnoisseur appeared in categories two and four, while the normal users were concentrated in categories one and three (Fig. 3).

Fig. 2.
figure 2

User dimension reduction visualization

Fig. 3.
figure 3

Feature density curves of different categories

Draw a feature density map for the 4 categories of sample users, shown in Fig. 2. It can be seen that there are differences between the characteristics of different categories of users, that is.

  • Category 1: random visit users. Characterized by small visit number, short visit time, and less website visit information.

  • Category 2: monitoring users. Users visit specific web pages for a long time and infrequently.

  • Category 3: normal access users. The categories of web pages visited by users are scattered, and the time of visiting web pages is more than that of category 2, which is a normal browsing user of web information.

  • Category 4: specific page access users. A large number of visits to the website in a short period of time, and the visited pages are targeted.

4.4 Result Analysis

The data of the previous week of the website are used as the training set for model training, and the econnoisseur detection is carried out on the user access of the latest day.The daily update and retraining of training data can obtain the recent overall distribution of users, and the model can be adjusted in real-time. During the Double Eleven period, these e-commerce companies had promotional activities, which also became the carnival of the econnoisseur.The user access data on November 11 and November 18 were extracted to compare the detection effect of econnoisseur during the active period and the non-active period.

Since the detected user data is unlabeled data and the sample size is large, 3000 users data are randomly selected from each website for labeling to check the test effect of the model. In Table 3, it can be seen that the detection precision and recall of the Isolated Forest model are higher than those of the other two models, showing that the model has better applicability.

After further analysis, it was found that some of the underreported econnoisseur were due to small difference between the amount of website URLs visited by econnoisseur and normal users. For example, aa/01 and aa/02, these two URLs are the same type of URLs, which can be combined to better to distinguish different user access situations. Consider combining URL of the same type based on the Bidirectional Long Short Term Memory (BiLSTM) model to reduce its impact on URL visits distribution. The detection effect of econnoisseur is shown in Table 4.

The average detected amount of econnoisseur during the period from November 1 to November 11 was taken as the average of daily econnoisseur amount during the activity period. The average detected amount of econnoisseur during the period from November 12 to November 30 was taken as the average of daily econnoisseur amount during the inactive period. The performance is shown in Fig. 4, it can be seen that the econnoisseur during the activity period is about twice as much as the non activity period.

It can be concluded from the above:

  1. 1.

    The Isolated Forest model performs better in precision and recall than the other two models in different e-commerce websites and active and inactive periods, it gave a good result in econnoisseur detection.

  2. 2.

    After combining URL of the same type with the BiLSTM model, the detection ability of the BiLSTM-IForest model to the econnoisseur is significantly higher than that of the Isolated Forest in the recall rate.

The detection ability of econnoisseur during non-activity period is better than that during activity period. According to the analysis, it is found that some users have visited the preferential products for many times with low frequency. This behavior is similar to that of normal users and has not been detected. It can be further optimized later.

Table 3. Model result comparison
Table 4. Comparison of results before and after improvement of Isolated Forest
Fig. 4.
figure 4

Detection amount of econnoisseur users in different periods

5 Conclusion

The rise of e-commerce platforms not only brings convenience to people's life, but also poses a higher challenge to the platform's risk control ability. How to control and reduce the risk of the platform being played for a sucker needs to be solved urgently. Based on the real log data of website users visiting the website in the Knownsec Security Intelligence Brain, this paper extracts nine features of users to identify the econnoisseur.

By comparison, it is found that the unsupervised anomaly detection model has certain detection ability for the e-commerce website econnoisseur. Among them, Isolated Forest has higher detection precision and recall rate than LOF and DBSCAN models in three e-commerce websites, and is more suitable for current e-commerce user data.

After that, the analysis found the same type of URL visited by users, which can be combined to better describe the real visit behavior of users. The detection results show that the econnoisseur detection based on the BiLSTM-IForest has been further improved.

The econnoisseur selected in this paper can intercept its traffic access in advance. After that, a risk control model can be built based on the actual browsing, purchasing and other specific behaviors of users to strengthen real-time prevention and control.

The econnoisseur and the platform have always been in a state of mutual competition. The so-called the devil is one foot tall and the road is one foot tall. We need to track the attack methods of the econnoisseur in real-time, and at the same time combine the risk control platform to further attack the econnoisseur.