Anomaly Detection of E-commerce Econnoisseur Based on User Behavior

Long, Yangyu; Zhao, Wei; Yang, Jilong; Deng, Jincheng; Liu, Fangming

doi:10.1007/978-981-19-8285-9_6

Yangyu Long¹⁰,
Wei Zhao¹⁰,
Jilong Yang¹⁰,
Jincheng Deng¹⁰ &
…
Fangming Liu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1699))

Included in the following conference series:

China Cyber Security Annual Conference

6831 Accesses

Abstract

Econnoisseur refers to users who obtain high returns from the Internet at low cost. It is of great significance for platform to identify econnoisseur to reduce unnecessary losses. At present, econnoisseur is mainly intercepted by rules. This method will fail when the new get the best deal method appears, and there is a certain lag. This paper identifies the econnoisseur from Knownsec Security Intelligence Brain’s e-commerce website visitors. First of all, it is found that the precision and recall of the Isolation Forest are better than the Local Outlier Factor and DBSCAN in econnoisseur detection. Secondly, we merged the similar URLs visited by users with Bi-directional Long Short-Term Memory (BiLSTM), then use the merged data in Isolation Forest Model. It is found that the improved Isolation Forest model based on BiLSTM can further improve the detection ability. Practical case studies showed that this method has certain validity and reference for the detection of econnoisseur.

You have full access to this open access chapter, Download conference paper PDF

A Proposal for Data Breach Detection in Organizations Based on User Behavior

Detection of Malicious URLs

Social network malicious insider detection using time-based trust evaluation

Article 24 April 2023

Keywords

1 Introduction

In recent years, e-commerce platforms have shown a trend of peak traffic dividends. Each platform will provide marketing activities to win customers and improve user stickiness. The process also give birth to the econnoisseur who exploit vulnerabilities in platform activity to profit.

According to the analysis report on the application of digital financial anti fraud Technology (2021) released by Institute of Cloud Computing and Big Data of China Academy of Information and Communications Technology and the ICBC Security Attack and Defense Laboratory, it is found that the total loss caused by the anti-fraud of black industry(see Fig. 1), showing an increasing trend every year, and the loss is expected to reach 710 billion yuan in 2022, the econnoisseur accounts for a large proportion. In 2019, Pinduoduo was robbed of tens of millions of yuan by the econnoisseur within a few hours because of an expired coupon bug on the platform. In 2021, Jingdong Mall was discovered and spread by the econnoisseur due to the wrong coupon setting, resulting in a direct loss of nearly 70 million yuan. On the one hand, the existence of the econnoisseur damages the profits of ordinary users, on the other hand, it also greatly reduces the company's activities, and its governance is urgent.

2 Related Work

The application of anomaly detection in the field of cyber security focuses on APT detection, intrusion detection and so on.

APT attack is a hidden and persistent network intrusion process, which carries out advanced persistent threats against specific targets. Bohara [1] compared various unsupervised algorithms such as K-means for APT detection and found that it can detect infected hosts. Zhong Yao [2] performed anomaly detection on traffic log data based on Isolated Forest and found that it has certain detection ability against APT attacks and can mark the suspected infected hosts.

Intrusion detection is a system that detects intruders in a network. The detection methods can be divided into supervised and unsupervised based machine learning algorithms. References [3,4,5,6] are mainly based on supervised algorithms such as Naive Bayes, Bayesian Networks, Hidden Markov Models, and ensemble learning for intrusion detection. This detection method requires a large number of labeled sample data, but there will be insufficient sample label data in many scenarios. In literature [7,8,9,10,11], unsupervised algorithms such as K-means clustering, hierarchical clustering and DBSCAN are used for intrusion detection, which has good detection ability.

At present, the research of econnoisseur detection is still in the theoretical stage, the engineering is mainly based on traditional threshold setting or rule-based interception. Yuan Dandan [12] based on the community discovery algorithm, identified the econnoisseur with similar characteristics into groups. When the econnoisseur characteristics change this method fails.

The challenges of e-commerce econnoisseur identification are as follows: 1. Rule omission. at present, most platforms intercept econnoisseurs based on rules, but the detection will be missed when econnoisseur behavior changes. 2. Insufficient sample labels. The econnoisseur is newly added every day, and manual labeling requires a large labor cost. 3. Model detection lag. With the iterative update of the econnoisseur’s method, the existing rules and models have a certain lag, so it is necessary to periodically iterate and maintain the model.

3 Theoretical Basis

Anomaly detection can be divided into supervised and unsupervised anomaly detection. Since the econnoisseur is generated in real-time, unlabeled sample data is mainly used in practical applications, this paper uses the unsupervised anomaly detection scheme. Considering the different applicability of detection methods in different scenarios and data, this paper compares three commonly used anomaly detection schemes to see their ability to identify econnoisseur in e-commerce website log data.

3.1 Isolated Forest (IForest)

The isolation forest model was first proposed by Zhou Zhihua's team and Fei Tony Liu of Monash University [13] as an ensemble learning method. It was used in the field of industrial anomaly detection due to the advantages of high accuracy and linear time complexity. The theoretical basis of the model are: 1. There are differences between abnormal data and normal data. 2. The proportion of abnormal data is relatively small. These two theories are consistent with the econnoisseur detection.The Isolation Forest algorithm cuts the data space through a random hyperplane. The data plane can divide the data into two subspaces at a time, until each subspace has only one sample point or reaches the given height of the tree.

The Isolated Forest needs to be trained on the Isolated Tree first to obtain the Isolated Forest. After that, calculated the isolated score S of each test sample, then compare the difference between isolated score S and the given threshold to see whether the sample is an abnormal sample.

After data training to obtain an Isolated Forest, the anomaly score of the test sample can be evaluated based on the generated Isolated Tree. Since the structure of the Isolation Tree is consistent with the binary search tree (BST), the average path length of the tree is consistent. Based on this, BST is used to estimate the average path length of isolated tree.

$$ c(n) = 2H(n - 1) - ({{2(n - 1)} / n}) $$

(1)

$$ H(n) = ln(n) + \rm{0}\rm{.5772156649} $$

(2)

Formula (1) is the average path depth of the isolated tree composed of n samples, and it is used to standardize the depth of the samples on the Isolated Tree, so that the abnormal score of the test sample x is shown in Formula (3).

$$ s(x,n) = 2^{ - \frac{E(h(x))}{{c(n)}}} $$

(3)

$$ E(h(x)) = \sum_{i = 1}^t {{{h_i (x)} / t}} $$

(4)

3.2 Local Outlier Factor (LOF)

The Local Outlier Factor [14] is a model that determines whether the sample points are abnormal based on the density. Its core concept is that the density of abnormal points is smaller than that of other points. Before introducing the Local Outlier Factor model, we need to understand some basic concepts.

Definition 1:

(k-distance). For point p, sort the distances between point p and other points from small to large, and the k-th closest distance point to point p is k-distance of point p. If point o is the k-th point closest to point p, then distance is k-distance of object p, i.e.

$$ k\_distance\left( p \right)\, = \,d\left( {p,o} \right) $$

(5)

Definition 2:

(k-distance neighborhood). Draw a circle with point p as the center and k-distance as the radius. The points in this circle is the k-distance neighborhood of p, i.e.

$$ N_k (p) = d(p,o^{\prime}) \le d_k (p) $$

(6)

Definition 3:

(reachability distance). Take point o as the center, and take the maximum value of the k-th distance nearest to point o, then the distance is the reachable distance from point p to point o.

$$ reach\_dist_k (o,p) = max\{ d_k (o),d(o,p)\} $$

(7)

Definition 4:

(local reachability density). The reciprocal of the average reachable distance in the neighborhood of point p is the local reachable density of point p, defined as

$$ lrd_k (p) = \frac{1}{{\frac{{\sum o \mathrel\backepsilon N_k (p)\,reach\_dist_k (p,o)}}{{\left| {N_k (p)} \right|}}}}. $$

(8)

Definition 5:

(local outlier factor). The mean of the local reachability density of points in the field divided by the local reachability density of point p is the local outlier factor of point p, defined as

$$ LOF_k (p) = \frac{{\sum o \mathrel\backepsilon N_k (p)\frac{lrd(o)}{{lrd(p)}}}}{{\left| {N_k (p)} \right|}}. $$

(9)

Algorithm steps:

1.
Calculate the distance between each sample point and all other points, sort them from near to far.
2.
For each sample point, find the point in its k-distance field, and then calculate its LOF score.
3.
Given the threshold, if the LOF value of the sample point is higher than the threshold, the sample point is an abnormal point.

Therefore, the algorithm calculates the density of the samples based on the local points in the k field of the sample points. The lower the density, the greater the probability of abnormal samples.

3.3 DBSCAN

DBSCAN(Density-Based Spatial Clustering of Applications with Noise) [15] is a density based spatial clustering algorithm.

The concepts involved in the algorithm are:

Definition 1:

(Core user). For sample point p, give a distance ε, if there are at least Minpts sample points within ε neighborhood, then p is the core point. For point p, its density is defined as $\rho (p) = \left| {N_\varepsilon (p)} \right|$. Where $N_\varepsilon \left( \cdot \right)$ denotes the set of points in its ε neighborhood. If p is a core user, then defined it as.

$$ \rho (p) = \left| {N_\varepsilon (p)} \right| \ge Minpts $$

(10)

Definition 2:

(Directly density-reachable). If point p and point q is directly density-reachable, the following two conditions must be satisfied:

i) p is in the ε neighborhood of a core point q $p \in N_\varepsilon (q)$. . ii) q is core user $\left| {N_\varepsilon (q)} \right| \ge Minpts$.

Definition 3:

(Density-reachable). If there is a point o so that both p and q can be directly density-reachable, then the point p and q densities is density-reachable.

Definition 4:

(Border user). For sample point p, if the sample points included in the ε radius are smaller than Minpts and the sample is in the field of other core points, sample p is the border user.

Definition 5:

(Noise user). If sample point p is of non core user and border user, then it is names noise user.

4 Research Process

4.1 Data Sources

The data is based on the real-time streaming log data in the Knownsec Security Intelligence Brain. The fields in the log data are: access time, access IP, user agent, URL link, website domain name, etc. Three e-commerce websites log data was screened, an IP and user agent was regarded as an independent visitor of the website.

Log data with abnormal access URLs or few visits is deleted to reduce interference to the model. The data of three e-commerce websites in November 2021 are sampled for observation to see their performance in one month. The website visit situation is shown in Table 1 :

Table 1. Website visit number.

Full size table

4.2 Feature Construction

After analyzing the access data of some users, it is found that the econnoisseur can be divided into two categories: 1. Monitoring users, who monitor the preferential information of commodities in real-time and at low frequency. 2. Activity type users, who make high-frequency application and purchase of goods with large discounts during the activity period.

Combing with the characteristics of the econnoisseur, we constructed three characteristics for users: website visits, website visit time and different website visits, totally nine features are shown in Table 2.

Table 2. User features

Full size table

4.3 User Behavior

Reduce the user feature data of website C to 2D for visualization based on t-SNE (t-distributed Stochastic Neighbor Embedding), as shown in Fig. 2. User data can be divided into four categories according to the color of points. Some sample data were extracted from the four categories and analyzed. It was found that the econnoisseur appeared in categories two and four, while the normal users were concentrated in categories one and three (Fig. 3).

Draw a feature density map for the 4 categories of sample users, shown in Fig. 2. It can be seen that there are differences between the characteristics of different categories of users, that is.

Category 1: random visit users. Characterized by small visit number, short visit time, and less website visit information.
Category 2: monitoring users. Users visit specific web pages for a long time and infrequently.
Category 3: normal access users. The categories of web pages visited by users are scattered, and the time of visiting web pages is more than that of category 2, which is a normal browsing user of web information.
Category 4: specific page access users. A large number of visits to the website in a short period of time, and the visited pages are targeted.

4.4 Result Analysis

The data of the previous week of the website are used as the training set for model training, and the econnoisseur detection is carried out on the user access of the latest day.The daily update and retraining of training data can obtain the recent overall distribution of users, and the model can be adjusted in real-time. During the Double Eleven period, these e-commerce companies had promotional activities, which also became the carnival of the econnoisseur.The user access data on November 11 and November 18 were extracted to compare the detection effect of econnoisseur during the active period and the non-active period.

Since the detected user data is unlabeled data and the sample size is large, 3000 users data are randomly selected from each website for labeling to check the test effect of the model. In Table 3, it can be seen that the detection precision and recall of the Isolated Forest model are higher than those of the other two models, showing that the model has better applicability.

After further analysis, it was found that some of the underreported econnoisseur were due to small difference between the amount of website URLs visited by econnoisseur and normal users. For example, aa/01 and aa/02, these two URLs are the same type of URLs, which can be combined to better to distinguish different user access situations. Consider combining URL of the same type based on the Bidirectional Long Short Term Memory (BiLSTM) model to reduce its impact on URL visits distribution. The detection effect of econnoisseur is shown in Table 4.

The average detected amount of econnoisseur during the period from November 1 to November 11 was taken as the average of daily econnoisseur amount during the activity period. The average detected amount of econnoisseur during the period from November 12 to November 30 was taken as the average of daily econnoisseur amount during the inactive period. The performance is shown in Fig. 4, it can be seen that the econnoisseur during the activity period is about twice as much as the non activity period.

It can be concluded from the above:

1.
The Isolated Forest model performs better in precision and recall than the other two models in different e-commerce websites and active and inactive periods, it gave a good result in econnoisseur detection.
2.
After combining URL of the same type with the BiLSTM model, the detection ability of the BiLSTM-IForest model to the econnoisseur is significantly higher than that of the Isolated Forest in the recall rate.

The detection ability of econnoisseur during non-activity period is better than that during activity period. According to the analysis, it is found that some users have visited the preferential products for many times with low frequency. This behavior is similar to that of normal users and has not been detected. It can be further optimized later.

Table 3. Model result comparison

Full size table

Table 4. Comparison of results before and after improvement of Isolated Forest

Full size table

5 Conclusion

The rise of e-commerce platforms not only brings convenience to people's life, but also poses a higher challenge to the platform's risk control ability. How to control and reduce the risk of the platform being played for a sucker needs to be solved urgently. Based on the real log data of website users visiting the website in the Knownsec Security Intelligence Brain, this paper extracts nine features of users to identify the econnoisseur.

By comparison, it is found that the unsupervised anomaly detection model has certain detection ability for the e-commerce website econnoisseur. Among them, Isolated Forest has higher detection precision and recall rate than LOF and DBSCAN models in three e-commerce websites, and is more suitable for current e-commerce user data.

After that, the analysis found the same type of URL visited by users, which can be combined to better describe the real visit behavior of users. The detection results show that the econnoisseur detection based on the BiLSTM-IForest has been further improved.

The econnoisseur selected in this paper can intercept its traffic access in advance. After that, a risk control model can be built based on the actual browsing, purchasing and other specific behaviors of users to strengthen real-time prevention and control.

The econnoisseur and the platform have always been in a state of mutual competition. The so-called the devil is one foot tall and the road is one foot tall. We need to track the attack methods of the econnoisseur in real-time, and at the same time combine the risk control platform to further attack the econnoisseur.

References

Bohara, A., Noureddine, M.A., Fawaz, A., et al.: An unsupervised multi-detector approach for identifying malicious lateral movement. In: Reliable Distributed Systems. IEEE (2017)
Google Scholar
Yao, Z.: Research and implementation of advanced persistent threats detection method based on data mining. Beijing University of Posts and Telecommunications (2019)
Google Scholar
Ren, X., Jiao, W., Zhou, D.: Intrusion detection model of Weighted Navie Bayes based on Particle Swarm Optimization algorithm. Comput. Eng. Appl. 52(7), 122–126 (2016)
Google Scholar
Sun, W., Zhang, P., He, Y., et al.: Attack detection method based on spatiotemporal event correlation in intranet environment. J. Commun. 41(01), 33–41 (2020)
Google Scholar
Duan, X., Jia, C., Liu, C.: Intrusion detection method based on hierarchical hidden Markov model and variable-length semantic pattern. Journal on Communications 31(03), 109–114 (2010)
Google Scholar
Li, X., Zhu, M., Yang, L.T., et al.: Sustainable ensemble learning driving intrusion detection model. IEEE Trans. Dependable Secure Comput. PP(99), 1–1 (2021)
Google Scholar
Otair, M., Ibrahim, O.T., Abualigah, L., et al.: An enhanced Grey Wolf Optimizer based Particle Swarm Optimizer for intrusion detection system in wireless sensor networks. Wireless Netw. 28, 721–744 (2022)
Article Google Scholar
Yang, J.: An improved intrusion detection algorithm based on DBSCAN. Microcomput. Inf. 25 (2009)
Google Scholar
Shamshirband, S., Amini, A., Anuar, N.B., et al.: D-FICCA: a density-based fuzzy imperialist competitive clustering algorithm for intrusion detection in wireless sensor networks. Measurement 55, 212–226 (2014)
Article Google Scholar
Veeraiah, N., Krishna, B.T.: Trust-aware FuzzyClus-Fuzzy NB: intrusion detection scheme based on fuzzy clustering and Bayesian rule. Wireless Netw. 25(8), 1–15 (2019)
Google Scholar
Khan, L., Awad, M., Thuraisingham, B.: A new intrusion detection system using support vector machines and hierarchical clustering. VLDB J. 16(4), 507–521 (2007)
Article Google Scholar
Dandan, Y.: Research and application of community detection algorithm based on node importance. Xidian University(2020)
Google Scholar
Fei, T.L., Kai, M.T., Zhou, Z.H.: Isolation Forest. In: IEEE International Conference on Data Mining. IEEE (2008)
Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., et al.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2000)
Google Scholar
Bäcklund, H., Hedblom, A., Neijman, N.: A density-based spatial clustering of application with noise. Data Min. (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Knownsec Information Technology Co., Ltd., Beijing, 100097, China
Yangyu Long, Wei Zhao, Jilong Yang, Jincheng Deng & Fangming Liu

Authors

Yangyu Long
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jilong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jincheng Deng
View author publications
You can also search for this author in PubMed Google Scholar
Fangming Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jincheng Deng .

Editor information

Editors and Affiliations

CNCERT, Beijing, China
Wei Lu
University of Chinese Academy of Sciences, Beijing, China
Yuqing Zhang
Peking University, Beijing, China
Weiping Wen
CNCERT, Beijing, China
Hanbing Yan
CNCERT, Beijing, China
Chao Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Long, Y., Zhao, W., Yang, J., Deng, J., Liu, F. (2022). Anomaly Detection of E-commerce Econnoisseur Based on User Behavior. In: Lu, W., Zhang, Y., Wen, W., Yan, H., Li, C. (eds) Cyber Security. CNCERT 2022. Communications in Computer and Information Science, vol 1699. Springer, Singapore. https://doi.org/10.1007/978-981-19-8285-9_6

Download citation

DOI: https://doi.org/10.1007/978-981-19-8285-9_6
Published: 10 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8284-2
Online ISBN: 978-981-19-8285-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Anomaly Detection of E-commerce Econnoisseur Based on User Behavior

Abstract

Similar content being viewed by others

A Proposal for Data Breach Detection in Organizations Based on User Behavior

Detection of Malicious URLs

Social network malicious insider detection using time-based trust evaluation

Keywords

1 Introduction

2 Related Work

3 Theoretical Basis

3.1 Isolated Forest (IForest)

3.2 Local Outlier Factor (LOF)

Definition 1:

Definition 2:

Definition 3:

Definition 4:

Definition 5:

3.3 DBSCAN

Definition 1:

Definition 2:

Definition 3:

Definition 4:

Definition 5:

4 Research Process

4.1 Data Sources

4.2 Feature Construction

4.3 User Behavior

4.4 Result Analysis

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation