Decoupling anomaly discrimination and representation learning: self-supervised learning for anomaly detection on attributed graph

Anomaly detection on attributed graphs is a crucial topic for its practical application. Existing methods suffer from semantic mixture and imbalance issue because they mainly focus on anomaly discrimination, ignoring representation learning. It conflicts with the assortativity assumption that anomalous nodes commonly connect with normal nodes directly. Additionally, there are far fewer anomalous nodes than normal nodes, indicating a long-tailed data distribution. To address these challenges, a unique algorithm,Decoupled Self-supervised Learning forAnomalyDetection (DSLAD), is proposed in this paper. DSLAD is a self-supervised method with anomaly discrimination and representation learning decoupled for anomaly detection. DSLAD employs bilinear pooling and masked autoencoder as the anomaly discriminators. By decoupling anomaly discrimination and representation learning, a balanced feature space is constructed, in which nodes are more semantically discriminative, as well as imbalance issue can be resolved. Experiments conducted on various six benchmark datasets reveal the effectiveness of DSLAD.


Introduction
To display the intricate and interconnected data, attributed graphs are frequently employed.Recently, anomaly detection on attributed graphs has attracted lots of interest, which seeks to identify some minority patterns (such as nodes, and edges.)that deviate from the majority tremendously on the graph [17].Anomaly detection on the attributed graph can be deployed in many real-world scenarios, such as spotting fraud in transaction networks, spotting incorrect citation relations among academic papers, and spotting users who deliver spam in postal transportation networks.
However, anomaly detection on attributed graphs is quite a challenging task that primarily faces three challenges.First, it is a heavy cost to obtain enough labels for anomalous nodes.Therefore, supervised models are not applicable for anomaly detection, as evidenced by the fact that ground-truth labels and the class of anomalies are always unknown [3].Second, anomalous nodes' neighbors are commonly normal nodes.GNN-based algorithms largely rely on aggregating messages from neighbors [32,8,1,25].As a consequence, anomalous nodes are buried by messages of normal nodes, leading to the mixture in semantic space.Third, the number of anomalous nodes is far less than that of normal nodes.Traditional deep learning algorithms suffer from imbalance issue [14,6,36,29] that the majority dominates the embedding training and the minority is often mistakenly identified as the majority.Therefore, it is urgent to propose an effective self-supervised algorithm to address the above three challenges for anomaly detection.
Several methods for anomaly detection on attributed graphs have been proposed.These methods have achieved great success in anomaly discrimination, but they still have some drawbacks.The shallow methods, such as AMEN [20], are limited by the capacity for expressiveness.The nodeclassification-targeted methods, such as DOMINANT [3], simply combining existing models for node classification and an anomaly discriminator, are not directly designed for anomaly detection.The anomaly-detection-targeted methods, such as CoLA [15] optimize the model directly for anomaly detection, but they mainly revolve around anomaly discrimination, paying insufficient attention to representation learning, which leads to semantic mixture and imbalance issue.
To overcome the aforementioned challenges, in this paper, we propose a novel method DSLAD for anomaly detection.In DSLAD, both contrastive learning and generative learning are adopted to discriminate anomaly.Especially, DSLAD contrasts node-subgraph pairs and measures reconstruction errors to calculate anomaly scores.The anomaly score is further categorized into context anomaly score and reconstruction anomaly score, deployed with bilinear pooling and masked autoencoder respectively as anomaly discriminator.Considering semantic mixture and imbalance issue, we introduce contrastive representation learning and decouple it with anomaly discrimination.Through decoupling anomaly discrimination and contrastive representation learning, DSLAD maps nodes into a balanced semantic space with a little semantic mixture.The contributions of this work can be summarized as follows: • We integrate contrastive representation learning into the anomaly detection model, which makes nodes more semantically distinguishable and vastly benefits anomaly discrimination.
• We decouple contrastive representation learning and anomaly discrimination, ulteriorly resolving semantic mixture and imbalance issue in anomaly detection.
• We conduct a series of experiments on six datasets and the results demonstrate the superiority of DSLAD over the existing models.

Graph Neural Networks
Recently, graph representation learning has achieved considerable success with GNNs.The core idea of GNNs is aggregating messages from neighbors to update node representations, which is based on the assortativity assumption.GNNs can be divided into two categories: spectral-based methods and spatial-based methods.The former category includes GCN [8], passing message by first-order approximation of Chebyshev filter.The latter category includes GAT [25] and GraphSAGE [4].GAT utilizing attention mechanism, assigns weight to each edge when aggregating messages.GraphSAGE proposes inductive representation learning manner to cope with tasks on large-scale graphs.
Apart from above fundamental GNN models, many advanced GNN models are also proposed to learn the graph representations better.To avoid the sparsity issue and filter the noise information, [12] proposes a framework preserving low-order proximities, mesoscopic community structure information and attribute information for network embedding.MTSN [16], a dynamic graph neural network, captures local high-order graph signals, learns attribute information based on motifs, and preserve timing information by temporal shifting.To alleviate the oversmooth issue, NAIE [2] adopts an adaptive strategy to smooth attribute information and topology information, and develop an autoencoder to enhance the embedding capacity.

Graph self-supervised learning
Self-supervised graph learning, a new learning paradigm that trains models without labels, has been widely used in computer vision [9] and natural language processing [10].Self-supervised learning in graph can be categorized into : graph contrastive learning (e.g.SimGRACE [30]), graph generative learning (e.g.Graph Completion [35]) and graph predictive learning (e.g.CDRS [39]).Without augmentation, SimGRACE, uses a formal encoder and a perturbation encoder to embed the graph, then pulls close the same semantic while pushing away the different semantics among the two hidden spaces.Graph Completion removes features of the target node, and then reconstructs it from the unmasked neighboring nodes.CDRS makes a pseudo node classification task collaborated with the clustering task to enhance representation learning.

Anomaly detection on attributed graph
Anomaly detection on attributed graph works for identifying patterns that notably diverge from the majority.Many methods have been proposed for anomaly detection, including the shallow methods and the deep methods.The shallow methods include [20], Radar [11], and ANOMALOUS [19].AMEN measures the correlation of features between the target node and its ego-networks to detect anomaly.Radar analyzes the residuals of attribute information and its coherence with graph information to detect anomaly.ANOMALOUS integrates CUR decomposition and residual analysis to detect anomaly.The shallow methods are limited by their expressiveness ability in graph embedding.The deep methods can be further divided into two classes.The first class deep methods include DOMINANT [3] and DGIAD [26,15].DOMINANT reconstructs the adjacency matrix and the attribute matrix, then distinguish anomaly through reconstruction error.DGI contrasts node and graph for embedding.Deployed with a trained discriminator, DGI can be used for anomaly detection and we rename this method as DGIAD in this paper.The first class deep methods, node classification models merely equipped with anomaly discriminator, are not devised for anomaly detection and still don't show satisfying performance.The second class deep methods include CoLA [15], SL-GAD [37], and ANEMONE [7].CoLA, SL-GAD and ANEMONE contrast nodes and subgraphs to discriminate anomaly.The second class deep methods optimize model toward anomaly detection, but they neglect representation learning to overcome semantic mixture and imbalance issue.

Problem definition
In this section, the problem definition of anomaly detection on attributed graph will be introduced.Given an attributed graph  = (, , ), the target of anomaly detection is to learn a mapping mechanism  (⋅) to calculate the anomaly score , ∈  for nodes in .The anomaly score describes the abnormal degree of the node .It is easy to detect anomaly, if the mapping mechanism  (⋅) is well designed and outputs accurate anomaly scores.For the convenience of reading this paper, all important notations are explained in Table 1.

Method
In this section, a thorough introduction to DSLAD will be given.As shown in Figure 1, DSLAD consists of four modules, discrimination pair sampling, GNN-based embedding, anomaly discrimination and contrastive representation learning.On attributed graph, contrastive learning at the node-subgraph level is powerful for graph representation learning [31,13].Tt has been discovered that detecting anomalies at the node-subgraph level is effective [15].To detect the anomaly, we sample discrimination pairs at nodesubgraph level.The target nodes and their sampled subgraphs are then embedded into low dimension vectors via GNN.Next, the embedding vectors of target nodes and their sampled subgraphs are fed into anomaly discrimination and contrastive representation learning.By contrastive representation learning, the semantic mixture and imbalance issue can be lightened, and decoupling anomaly discrimination and contrastive representation learning can further alleviate it.The attribute matrix of  ∈ ℝ × The adjacency matrix of  The attribute matrix of The embedding of in the −th layer ( )) ∈ ℝ ( −1)× ( ) The weight matrix in the -th layer The hidden matrix in the -th layer of  The reconstructed attribute matrix of  The attribute of The weight matrix of bilinear pooling The negative context anomaly score of The positive context anomaly score of The reconstruction anomaly score of The anomaly score of

Discrimination pair sampling.
The key to anomaly detection is finding the patterns significantly different from the majority.Therefore, discrimination pairs are crucial to this task.Graph objects can be categorized into edge, node, subgraphs and graph.Any two of them, excluding edge, can be selected to constitute discrimination pairs.We sample discrimination pairs at nodesubgraph level.The procedure is as follows: • Target node selection.A set of nodes are randomly selected from the input graph every epoch without replacement so that each node has the same chance of being chosen.
• Subgraph sampling.For every selected target node, a neighboring subgraph is sampled via random walks with restart (RWR) [24] as augmentation, avoiding introducing extra anomalies.Other sampling methods also can be considered.The size of the neighboring subgraph is fixed to K, which determines the scope of the target node for matching.
• Attribute mask.The attributes of the target node are masked with zero vectors in the sampled subgraph, making it more difficult to identify the information of the target node in the subgraph.This mechanism will improve the ability of anomaly detection [15,37].
Target nodes and neighboring subgraphs are combined as discrimination pairs for anomaly discrimination.A positive pair includes a node and a subgraph sampled from it, while a negative pair includes a node and a subgraph sampled from other nodes.

GNN-based embedding
For anomaly discrimination and contrastive representation learning, obtained target nodes and their neighboring subgraphs are mapped into low-dimensional embedding space by GNNs.
We apply a GCN encoder and a GCN autoencoder to embed the graph and reconstruct the attribute matrix, respectively.
Target node is embedded as a graph with only one node.GNN propagation formula can thus be simplified to MLP: And ∈ ℝ is used to denote the output of GCN encoder, which is the node level representation vector of .On the K-nodes subgraph  sampled from node , the adjacency matrix is denoted by { } ∈ ℝ × and the attribute matrix is denoted by { } ∈ ℝ × (0) .Then, the GNN operator is applied to: where ̃ { } = { } + , and (0) { } = { } .The output of the GCN encoder is denoted by ∈ ℝ × , which is the context representation matrix of subgraph  .And the output of the GCN autoencoder on  is denoted by { } ∈ ℝ × (0) , which is the reconstructed attribute matrix of  .
The readout module summarizes into its subgraphlevel representation ∈ ℝ .We take average pooling as the readout module.The subgraph-level representation can then be formulated as: where is the index of in neighboring subgraph  .

Anomaly discrimination
In this subsection, We will describe how DSLAD discriminates anomaly.Context anomaly and reconstruction anomaly are two subtypes of anomaly discrimination.Here is more information on them in depth:

Context anomaly
Anomalies differ from the other majority significantly.The anomalous nodes are supposed to be far away from normal nodes in the embedding space.To assess how a discrimination pair matches, we take bilinear pooling as the discriminator.
Given a node and the relevant neighboring subgraph  , the context anomaly score of this discrimination pair can be calculated as: where ∈ ℝ × is a learnable weight matrix, and (⋅) is non-linear and non-negative activation function.Here we use Sigmoid as the activation function.
We use graph contrastive learning at the node-subgraph level, taking both positive and negative discrimination pairs into account.For target node , we take P positive discrimination pairs and Q negative discrimination pairs to compute context anomaly score.The positive score (−) and the negative score (+) are formulated as: where { (−) } ∈ ℝ and { (+) } ∈ ℝ denote the positive neighboring subgraph set and the negative neighboring subgraph set for the target node , respectively.For simplicity, we set In this part, our optimization goal is maximizing the agreement with the context anomaly score and the groundtruth label (label 1 for positive pairs and 0 for negative pairs).The loss function of context anomaly score can be formulated as:

Reconstruction anomaly
Inspired by [37,35], we introduce the reconstruction error as a supplementary mechanism to anomaly discrimination.For target node , we have removed its attributes on neighboring subgraph  .DSLAD tries to reconstruct the attributes of the target node , from the other nodes on  .
2 -norm is adopted to measure the distance between original information and reconstructed information quantitatively.
The index of in the neighboring subgraph  is .The reconstructed attribute vector of in neighboring subgraph In order to train the masked autoencoder for reconstruction anomaly, we adopt MSE as the loss function for this portion.It can be written as: where x ∈ ℝ (0) is the original attribute vector of node .

Contrastive representation learning
In the context anomaly module, the loss function in equation (7) mainly focuses on anomaly discrimination and pays less attention to representation learning.Additionally, it assumes that all nodes contribute equally, leading to semantic mixture and imbalance issue.To impede the normal nodes from dominating the representation learning, we implement the contrastive representation learning module and set the number of positive samples and negative samples equal.
For target node , we select neighboring subgraph  ( ≠ ) sampled from node as the negative sample, in consistency with context anomaly module.As for the positive sample, the neighboring subgraph  , augmented by the strategy _ , is selectable.Another augmentation strategy, denoted by _ , embeds the whole graph without the mask.Target node augmented by _ can also be the positive sample.− denotes representation vector of negative sample, and − = , ≠ .+ denotes representation vector of positive sample, and where ∈ ℝ is computed by equation ( 3) and denotes GCN encoders in context anomaly module.Both the number of positive samples and negative samples are set to 1 for simplicity and fairness.We adopt in-foNCE [18] as the loss function of contrastive representation learning: , (10) where is a temperature parameter greater than 0.

Decoupling
In this subsection, we explain why and how we decouple anomaly discrimination and contrastive representation learning.When behaviors and label semantics are excessively inconsistent in anomaly detection tasks, [28] has shown that training graph representation learning and anomaly discrimination jointly may lead to performance degradation.Moreover, the problem of class imbalance can also be resolved significantly by decoupling representation learning and anomaly discrimination.At the beginning of training, discriminators are prone to predict arbitrarily, producing erroneous results while contrastive representation learning forms a balanced semantic space [33,5,34].During training, discriminators makes increasingly accurate predictions while performance gained by contrastive representation learning decays [38,27].Gradually shifting to anomaly discrimination from contrastive learning enhances the effectiveness.Based on the above analysis, instead of jointly training by classification loss, we decouple anomaly discrimination and contrastive representation learning and give them dynamic weights.
Let denotes the ratio of current epoch to the number of training epochs, whose value indicates the training process.
( ) is the factor balancing the anomaly discrimination loss and the contrastive representation learning loss, where (⋅) is a mapping function.The final loss function can be written as: , (11) where and are the hyperparameter that control the role of different anomaly scores, and scale contrastive representation learning, respectively.And ( ) increases with .

Anomaly score calculation
We could calculate the final anomaly score for each node after training.
For node , context anomaly score and reconstruction anomaly score can be inferenced as follows: where (−) and (+) are calculated by equation ( 5) and ( 6) respectively.
where the index of in the neighboring subgraph  is , the reconstructed attribute vector of in neighboring subgraph  is { } [ , ∶], and x is the original attribute vector of node .By MinMaxScalar, we transform the context anomaly score to [0,1] for standardization: where and are the min and the max of context anomaly scores, respectively.Similarly, reconstruction anomaly score is also transformed to [0,1] by MinMaxScalar: where and are the min and the max of reconstruction anomaly scores, respectively.
Combining transformed context anomaly score and reconstruction anomaly score, we can get the final anomaly score of node : Neighboring subgraph is sampled stochasticly.To reduce the sampling variance, we take the averaging anomaly score over times as the final anomaly score.

Experiments
In this part, a succession of experiments are carried out on six real-world datasets to examine the effectiveness of our model.

Datasets
Six frequently used real-world datasets for anomaly identification, including four citation network datasets and two social network datasets, are applied to evaluate our model.
The following is a brief overview of the six datasets: • Citation network datasets.Cora, Citeseer, Pubmed [21] and ACM [22] are four public citation network datasets, composed of scientific publications.In the four citation networks, the published papers are transformed into nodes while edges represent the citation relationships between papers.And the description text of papers can be transformed into nodes features.
• Social network datasets.BlogCatalog and Flickr [23] are acquired from the websites for sharing blogs and images, respectively.In the two datasets, each user is represented by a node, and links among nodes illustrate the relationships between corresponding users.Users often describe themselves with personalized information, such as posting blogs and public photos.Features can be extracted from such information.
Considering that there are no ground-truth anomaly labels in above six real-world datasets, injecting synthetic anomaly nodes into datasets to simulate real anomalies is used widely.We follow the perturbation processing in [15,3] to inject anomalies with both attribute anomalies and structure anomalies into the six datasets.For the attribute anomaly injection, we select nodes and replace their features with stochasticly selected remote nodes.For the structure anomaly injection, we pick up nodes and divide them into clusters averagely.Nodes within the same cluster are connected with each other.The statistics of these contaminated datasets are depicted in Table 2.

Baselines
We choose some of the state-of-the-art methods as baselines to compare with our proposed DSLAD on the above six real-world datasets.These methods are divided into three categories: (1) The shallow method: The shallow method detects anomalies without deep learning.We pick up the following three models for comparison: • AMEN [20] compares the correlation of features of the target nodes and theri egonetworks to identify the nodes with low scores as anomalies.
• Radar [11] analyzes the residuals of attribute information and its coherence with graph information to detect the abnormal nodes as anomalies.
• ANOMALOUS [19] utilizes CUR decomposition and residual analysis to distinguish the irregular nodes as anomalies.
(2) The node-classification-targeted method: The nodeclassification-targeted method simply expand the node classification model with an anomaly detection module to detect anomaly.We choose the following two models for comparison: • DOMINANT [3] learns node embeddings by autoencoders and take the reconstruction errors as the anomaly scores.• DGIAD [26,15] uses DGI to learn node embeddings and take the bilinear pooling to compute the anomaly scores.
(3) The anomaly-detection-targeted method: The anomalydetection-targeted method is designed to detect anomaly directly, even without considering node classification.We select the following three models for comparison: • CoLA [15] learns node embedding by GCN and contrasts nodes and subgraphs to discriminate anomaly.

Evaluation metrics
We utilize ROC-AUC, a widely used metric for anomaly detection, to quantify the performance of DSLAD and the baselines.The ROC curve is depicted by the true positive rate (y-variable) and the false positive rate (x-variable).AUC is the area enclosed by the ROC curve the x-axis.AUC always falls between 0 and 1.The better performance is indicated by the higher AUC.

Experiments setting
We set neighboring subgraph size ∈ + in [2,10].Layers of GNN encoders and GNN decoders are set as 3 on Flickr, and 1 on the other five datasets.The hidden dimension is set as 64 while test rounds = 256.

Comparison Results
To verify the effectiveness of our model in anomaly detection task, we conducted comparison experiments for all baselines and DSLAD with AUC metric on six benchmark datasets and results are shown in Table 3.Based on the results, we can make the following observations: • Compared with the most advanced baselines, our method outperforms baselines on all benchmark datasets with a large margin, improving 0.72% at least, 8.35% at most and 2.59% on average.It reveals the effectiveness of our method.
• The shallow methods AMEN, Radar, and ANOMALOUS perform worse than other baselines because of the limitation of expressiveness capacity.
• The node-classification-targeted methods DOMINANT and DGIAD perform better than shallow methods.DOMINANT reconstructs attribute matrix and adjacency matrix, not directly targeting to detect anomalies.DGIAD contrasts the nodes and the whole graph, utilizing very little local information.
• The anomaly-detection-targeted methods CoLA, ANEMONE and SL-GAD make a step further.However, they mainly concentrate on training anomaly discriminator, still constrained by semantic mixture and imbalance issue.

Augmentation strategy
In this subsection, we assess the effectiveness of the augmentation strategy to our method.As illustrated in Figure 2 (a), our model is not sensitive to the augmentation strategy on Cora, and performs better with _ than with _ on the other datasets except for Flickr.The main reason may be that Flickr has the most complex attribute information, which is more important than structure information.And on the other five dataset, structure information has a greater impact on the other five datasets than attribute information.
Based on the above observations, we choose _ as augmentation strategy on Flickr, and _ as augmentation strategy on the rest five datasets.

( ) strategy
In this subsection, we investigate how different mapping functions ( ) would effect our model.Constant e.g.0.5, linear growth e.g. , and activation function e.g. 1 − (− ), ( ),and ℎ( ) are taken into consideration.As demonstrated in Figure 2 (b), when setting ( ) = , the best results are acquired, although our model is not sensitive to ( ) on Pubmed and Flickr.Linear ( ) also has the best generalization.

Ablation studies
In this subsection, we conduct ablation studies for better understanding the effectiveness of each components in our method.Here, three variants are defined as: • DSLAD w/o cl: remove contrastive representation learning and set ( ) = 1.
As shown in Table 5, DSLAD outperforms other variants, indicating that all components play an important role in our method and they could make mutual promotion.We demonstrate the visualization of embedding on Cora in Figure 3.In Figure 3, we can find that both the normal nodes and the anomalous nodes are more dense learned by DSLAD than CoLA, and DSLAD even works for node classification.Removing contrastive representation learning causes remarkable performance degradation.Obviously, contrastive representation learning promotes anomaly discrimination a lot by reducing class imbalance and lightening semantic mixture.Additionally, removing either context or reconstruction scores performance would result in performance degradation with the former being more noticeable.This demonstrates that context score and reconstruction score complement each other, while context scores are more effective for anomaly detection.

Parameter sensitivity
In this subsection, a series of experiments are conducted to study the effect of hyperparameters.

Subgraph size
DSLAD is executed on six benchmark datasets with subgraph size within [2,10].It is seen from Figure 4 that AUC grows until the peak and then drops with subgraph size increasing.DSLAD achieves the AUC peak at = 5 on Citeseer and Flickr, at = 7 on Pubmed, and = 4 on the rest datasets.These results show that too small subgraph contains insufficient information, restricting anomaly detection; Too large subgraph contains tedious information, which would hurt our model; Applicable subgraph size guarantee DSLAD in best performance.

Effect of hyperparameter and
Besides hyperparameter subgraph size , we also discuss and .
To explore the effect of hyperparameter , we select its value from {0.2,0.4,0.6,0.8}.As illustrated in Figure 5, when = 0.6, DSLAD has the best performance on Citeseer, BlogCatalog, and Flickr.When = 0.8, DSLAD has the best performance on Cora, Pubmed, and ACM.This shows that no matter any datasets, the context anomaly score has a greater impact than the reconstruction anomaly score, supporting the assertion that the context score affects more.

Conclusion
In this paper, a novel framework called DSLAD is proposed for graph anomaly detection.DSLAD is composed of four modules: discrimination pair sampling, GNN-based embedding, anomaly discrimination and contrastive representation learning.Both contrastive learning and generative learning are employed to discriminate anomaly.They complement one another and improve the effectiveness.The contrastive representation learning, greatly alleviating the semantic mixture and imbalance problem, generates a more balanced semantic space and facilitates node embedding.
By decoupling anomaly discrimination and contrastive representation learning, the performance of DSLAD is undoubtedly improved.In the future, we will expore a unified representation learning framework for anomaly detection and node classification.

Figure 1 :
Figure 1: Framework of DSLAD.There are four components in DSLAD: Discrimination pairs sampling, GNN-based embedding, Anomaly discrimination and Contrastive representation learning.DSLAD firstly selects a set of target nodes and samples neiboring subgraphs of them.Next, the nodes and the sampled subgraphs are embedded into low dimension vectors by GNN for anomaly detection and contrastive representation learning.Finally, the discriminators measures the distance between node-subgraph pairs to discriminate anomaly while target nodes are pulled close to positive samples and pushed away from negative samples (when training only).Especially, anomaly discrimination and representation learning are decoupled.

Figure 2 :
Figure 2: Performance comparison between different positive augmentation strategies and mapping function (⋅)

Figure 5 :
Figure 5: Parameter sensitivity studies for hyperparameter and .The values of AUC are correlated with colors by viridis.
The research is supported by the Key-Area Research and Development Program of Guangdong Province (2020B0101 65003, 2020B0101090005), the National Natural Science Foundation of China (62176269), the Guangdong Basic and Applied Basic Research Foundation (2019A1515011043), the Innovative Research Foundation of Ship General Performance (25622112), and the National Natural Science Foundation of China and Guangdong Provincial Joint Fund (U1911202).CRediT authorship contribution statement YanMing Hu: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing -Original Draft, Visualization, Formal analysis.Chuan Chen: Conceptualization, Methodology, Writing -Original Draft, Supervision.BoWen Deng: Validation, Writing -Original Draft.YuJing Lai: Writing -Original Draft.Hao Lin: Resources, Project administration.ZiBin Zheng: Resources, Project administration.Jing Bian: Supervision.

Table 1
Statements of important notations

Table 2
The statistics of datasets.

Table 3
Comparison experiment results of anomaly detection by AUC metric on six benchmark datasets.The best performance and the second-best performance methods are marked by bold and underlined fonts respectively.P-value=0.0331(popmean=mean+std), and the std can be seen in Table4.

Table 5
Ablation studies on six benchmark datasets.How each module would effect the whole model is explored.Variants DSLAD w/o cl, DSLAD w/o con and DSLAD w/o rec are generated by removing contrastive representation learning and set (⋅)=1, removing context score and removing reconstruction score, respectively.