1 Introduction

With the continuous adjustment of energy structure, natural gas, as clean energy, is growing rapidly in energy consumption. During the implementation of replacing coal with gas policy, more and more boiler room users supply heating for inhabitants by consuming natural gas in winter [22]. However, some users steal gas through refitting equipment illegally to reduce the measured gas volume and pay gas fees [13]. The gas theft behavior causes huge economic losses and may lead to gas leakage and even explosion, resulting in a series of gas safety hazards and social problems. To catch gas thefts, gas companies carry out manual on-site inspections indiscriminately. Such a random approach is inefficient and lagging due to the lack of specific suspects. Hence, it is crucial to discover gas-theft suspects automatically, which can help narrow the suspicious range to improve the verification efficiency and timely disposal to avoid accidents.

Fortunately, we can remotely collect gas consumption data through smart gas meters. With these sensed data, we can analyze gas usage patterns and further build a data-driven approach to discover gas theft suspects. However, there are two challenges. (1) Diversity of user behavior: the meanings of stealing for gas theft are diverse, and the behavior laws of gas usage are different, which is reflected in that the behavior fluctuation of legitimate users may also appear abnormal; (2) Weak marking: labeled data only accounts for a small part of limited gas thefts, while a large number of unlabelled data include a very large number of normal users and the rest of gas thefts. We cannot generate some synthetic gas theft labels according to the limited labels, because natural gas theft can only be judged through on-site inspection. Besides, it is not easy to set criteria for gas thefts regarding the degree and anomaly pattern.

Typically, this problem can be regarded as a time series anomaly detection task, focusing on pattern-wise anomalies [3]. However, when the abnormal proportion is very small, the direct use of off-the-shelf classifiers may yield biased results. Furthermore, the absence of a normal sample will lead to considerable uncertainty in dealing with new anomalies [6]. Recently, weakly supervised anomaly detection, especially for positive-unlabeled learning framework, has received extensive attention, using limited labeled anomalies and large unlabeled data [15]. The process has two parts [8]: the initial model selects normal sample candidate sets from unlabeled samples [17], and a modified model is then trained on the new labeled data to identify the other anomalies [28].

For gas scenarios, two studies put forward solutions to detect gas-theft suspects. Yi et al. obtained abnormal users through the negative correlation between temperature and gas consumption, and then use One-Class Support Vector Machine (OCSVM) to divide abnormal users into theft suspects and irregular users [27]. Yang et al. detect the stability of gas use mode to obtain normal users, and then combine normal users with gas-theft labels into positive–negative sample pairs to obtain theft suspects through Ranknet [26]. Nevertheless, the two studies have some limitations. Firstly, they set simple and limited statistical indicators, which are easy to cut across the board. Besides, features are extracted manually, which may fail to capture complex gas usage patterns. Furthermore, the utilization of a small amount of abnormally labeled data and a large amount of unlabeled data is low. These operations make the whole process cumbersome and error-prone.

In response to these challenges and drawbacks, we propose a neural clustering and ranking approach to detect gas-theft suspects among boiler room users. Our approach contains two modules: (1) normal user identification, which uses the behavior rules of most normal users to distinguish unstable and normal users whose behavior is abnormal by integrating representation learning and clustering. (2) suspicious user detection, which deeply excavates the correlations between different users and discovers gas-theft suspects among unstable users through the anomaly score of triplet ranking. Thus, the two modules are seamlessly connected by the combination of clustering and ranking neural networks, which can learn the gas consumption patterns in a deep manner and overcome the problem of label scarcity. Our contributions are four folds:

  • Under the positive-unlabeled learning framework, we propose a neural clustering and ranking approach for gas-theft suspects detection, which can narrow the suspicious range to increase the efficiency of inspection workforces.

  • Considering the regular behavior of users, we propose a joint clustering module to obtain reliable normal users by joint optimization of representation and clustering, which can learn the pseudo-normal labels and decrease the scope of abnormal users.

  • Considering the behavior correlations among users, we propose a triplet ranking module to detect suspects by learning the closeness and deviation relations with the constructed triplets, which can improve data utilization.

  • Extensive experiments on three real-world datasets show that our approach has obvious advantages over baselines in reducing the false-positive rate.

2 Overview

2.1 Task Definition

Given the daily gas consumption records \(X = \left\{ x_1, x_2,..., x_n\right\}\) of n boiler room users, we aim to detect whether the user has gas theft behaviors among all users U. Here, \(K \in U\) is a very small labeled abnormal user set, which has k users and \(k \ll n\).

Fig. 1
figure 1

Framework of proposed approach. Illustrates the framework of the proposed neural clustering and ranking approach, consisting of two modules: joint clustering for normal user identification and triplet ranking for suspicious user detection. The input of the model is gas consumption records and gas-theft, and the output is gas-theft suspects

Figure 1 illustrates the framework of the proposed neural clustering and ranking approach, consisting of two modules: joint clustering for normal user identification and triplet ranking for suspicious user detection. Firstly, we use a variational autoencoder to learn the hidden representation of gas consumption records. Then, considering the representation of close distance within classes and far distance between classes, we cluster the representations into groups to distinguish the normal and unstable users. Here, the joint clustering synchronous optimizes the clustering label allocation and fine-tunes the encoder network. Based on the learned representation, we take the identified normal samples and the labeled abnormal samples as candidate sets, and then construct triplets. After that, we train the anomaly scoring network with the triplets to generate an anomaly score for each given user. If the anomaly score is higher than the threshold, the user is regarded as a suspect. In this way, two modules are connected to overcome the label scarcity problem and achieve better detection accuracy.

3 Normal User Identification

The most collected gas users are unlabeled, while only a small part is labeled abnormal users. Using unlabeled users as normal samples indiscriminately may lead to negative effects, because some of these data have abnormal behaviors. Besides, due to the diversity of abnormal behavior, it is difficult to detect other new types of abnormal behavior from the limited labeled abnormal data. Therefore, our goal is to find reliable normal users and distinguish unstable users. Thus, we can not only reduce the scope of suspects for reducing the overall complexity, but also provide negative samples (normal users) for subsequent detection module.

Considering that normal users account for the vast majority of all users, and most normal behaviors have certain regularity, we use the representation of close distance within classes and far distance between classes for clustering users into groups. Figure 2 shows the framework of joint clustering network. Specifically, we use a variational autoencoder for pre-training, and input extracted gas feature representation to K-means for initialization clustering. After that, joint clustering synchronously optimizes the clustering label allocation and fine-tunes the network. Thus, all users can be classified into normal users and unstable users.

Fig. 2
figure 2

Joint clustering for normal user identification. Shows the framework of joint clustering network. The model uses variational autoencoder for pre-training, and then uses clustering for fine-tuning

3.1 Variational Autoencoder

Variational autoencoder (VAE) is widely used for dimensionality reduction to learn robust hidden representation. Generally, it compresses input data to a low-dimensional latent space and then reconstructs the data, making the input and output as close as possible. The VAE consists of two parts: an encoder function \(f(z|x_i)\) and a decoder \(h(x_i|z)\), where \(x_i\) indicates the gas consumption time series data and z indicates the hidden representation. The loss function of the VAE is a negative log-likelihood function with the regular terms, which is the loss sum of each data point, expressed as:

$$\begin{aligned} L_v=\sum _{i=1}^{n}(-E_{f(z|x_i)}\,[log_{h_{(x_i|z)}}]+KL(f(z|x_i)||h(z))) \end{aligned}$$
(1)

where n is the number of gas users. The first term is the feature reconstruction loss. The second term measures the degree of approximation of the posterior distribution \(f(z|x_i)\) and the real distribution h(z), which encourages \(f(z|x_i)\) to approach h(z) and ensures that the potential space is regularized.

3.2 Clustering

In such an almost unsupervised case, we use clustering to divide user groups based on the learned representation. Here, we use K-means to obtain the initial cluster center, in which the cluster number of K-means is selected according to the experimental results. Then, we use Kullback–Leibler (KL) divergence loss [7] to gradually match the model to a suitably-shaped distribution, so as to slowly update the cluster center and data representation.

We first calculate the soft label between the embedded point and the cluster centroid. Then, the cluster centroid is refined by using the auxiliary target distribution to learn from the high confidence allocation. We repeat this process until the convergence criteria are met. By minimizing KL divergence aligned loss, the samples close to the cluster center can be closer, making it easier to divide the data in the representation space. Thus, the cluster loss \(L_c\) is calculated with KL divergence to narrow the gap between the theoretical distribution Q and the target distribution P, resulting in a more compatible potential representation:

$$\begin{aligned} L_c=KL(P||Q)=\sum _i \sum _j p_{ij}log \frac{p_{ij}}{ q_{ij}} \end{aligned}$$
(2)

where \(q_{ij}\) measures the similarity between embedded sample \(z_i\) and cluster j [11], and \(p_{ij}\) is the target distribution that represents the sum of the probabilities that all samples z belong to cluster j.

The existing deep joint clustering is composed of neural network and clustering [, 12, 25], but in the process of clustering, minimizing the loss function of clustering may obtain a degenerate solution: the neural network maps all samples x to the same point. That is, the loss function is 0, but all samples are in the same class. Therefore, we add additional constraints to eliminate the degenerate solution. Here, we consider the display constraint samples to distribute them evenly into two categories.

$$\begin{aligned} L_m=max \left\{ 0,\frac{m_{y=1}}{m_y} -b \right\} \end{aligned}$$
(3)

where \(m_y\) represents the number of samples participating in the classification, \(m_{y=1}\) represents the number of samples predicted as abnormal category, and b is a margin parameter that controls the classification proportion.

To make the feature representation contain the cluster information, we regard the encoding network of VAE as the clustering network, that is, the loss function of the coding network includes reconstruction loss and clustering loss. By synthesizing the above objectives, the joint clustering loss \(L_{norm}\) is the sum of reconstruction loss, KL divergence aligned loss, and constraint loss with hyperparameters \(\gamma , \beta\) that control the degree of distorting embedded space:

$$\begin{aligned} L_{norm} = L_v + \gamma L_c + \beta L_m \end{aligned}$$
(4)

The robust feature representation learned by the VAE help to improve the performance of the clustering, and the clustering results in turn can guide the network to learn better representation. Joint clustering optimizes VAE and clustering synchronously by the encoding network, which helps the network better learn clustering and strong representation constrained by the decoding network.

4 Suspicious User Detection

After joint clustering for normal user identification, there remain unstable users to be further analyzed. In this section, we aim to detect gas-theft suspects among unstable users. As the number of collected abnormal labels is limited, it is not easy to train the model directly. Hence, we can generate more training samples by the triplet-wise ranking method to improve the data utilization of labeled gas thefts. Considering that similar users are close to each other and different users are mutually exclusive in the representation space, we excavate the behavior correlations between different users to judge whether it is a suspicious user.

Fig. 3
figure 3

Triplet ranking for suspicious user detection. Shows the framework of the triplet ranking network. It consists of triplet input, anomaly score generator, anomaly score and deviation loss

Figure 3 shows the framework of the triplet ranking network. Firstly, we take the normal sample set as a normal group, randomly select a single normal sample from the normal sample set, randomly select a single abnormal sample from the abnormal sample set, and construct a triple based on the representation of these samples. Then, we design an anomaly-scoring network with multiple fully connected layers, which takes the representation as input and generates an anomaly score. Considering the closeness deviation in the representation space, we construct the deviation among the triplet members based on the Z-score to design the loss. After that, we input the identified unstable users into the model to get the corresponding score. According to the score of historical behavior, we set an exception score threshold to determine whether it is a suspicious user.

The existing methods realize the end-to-end learning of abnormal scores through neural deviation learning and use some marked anomalies and prior probabilities to force the statistically significant deviation between the abnormal scores and the abnormal scores of upper tail normal data objects [16]. In response to our problem, we formulate the detection problem as a triplet relation learning task to generate more training samples and perform anomaly score learning. Here, we take the identified normal samples as the normal candidate set N and the labeled abnormal samples as the abnormal candidate set A, and construct triplet instance pairs for data enhancement. Specifically, we sample a normal example \(z^+\) and an abnormal example \(z^-\) from N and A respectively. We use unified sampling instead of importance sampling to generate more triple combinations to improve data utilization. Let \(T = \left\{ \left\{ N, z^+, z^- \right\} | z^+\in N, z^- \in A\right\}\) be a meta triplet instance, which contains critical information for discriminating anomalies from the normal user. The normal candidate set the first mock exam to ensure the quality of the triplet sampling.

Based on the representation of each user, we design an anomaly-scoring learner \(\phi (\cdot )\) with multiple fully connected layers, generating the corresponding anomaly score for each given input user. To make the statistically significant deviation between the anomaly scores of all abnormalities and the anomaly scores of normal users, we use the closeness deviation to represent the deviation between a user and the normal user group as one of the measurement standards. Specifically, the deviation is specified as a Z-score:

$$\begin{aligned} d(z)=\frac{\phi (z)-\mu _R}{\sigma _R} \end{aligned}$$
(5)

where a reference score \(\mu _R\) is defined as the average of the anomaly scores of the normal data set N and \(\sigma _R\) is the standard deviation of the anomaly scores of N to calculate the closeness deviation of the triplet.

Then, we define hinge loss \(L_{sus}\) for the triplet to optimize the score generator, which optimizes the small deviation between normal samples and normal groups, the large deviation between abnormal samples and normal groups, and the large deviation between abnormal samples and normal samples:

$$\begin{aligned} \begin{aligned} L_{sus}= max\left\{ 0,|d(z^+)|-d(z^-)-\phi (z^-)+\phi (z^+)+c\right\} \end{aligned} \end{aligned}$$
(6)

where c is a margin. The loss pushes the anomaly scores of normal users \(\phi (z^+)\) as close as possible to the reference score of the normal group \(\mu _R\) and pushes the anomaly scores of normal users \(\phi (z^-)\) as far away as possible to \(\mu _R\). Note that, if an anomaly \(z^-\) has negative \(\phi (z^-)\) and \(d(z^-)\), the loss is particularly large, which encourages large positive deviations for all anomalies. Therefore, the deviation function of the normal sample group, normal samples, and abnormal samples enables the network to learn the easy-to-explain anomaly score. During the inference phase, we feed the representation of the unstable users into the network, where the users with a higher score are more suspicious.

We use the closeness deviation between triplet members to design the deviation loss, deeply mine the behavior correlation between different users, and use the triplet-wise training method to generate more training samples to improve the data utilization. Besides, we use extracted representation from normal user identification as input format, take the identified normal samples as the training candidate set, and the identified unstable samples as the detection objects. Thus, normal user identification and suspicious user detection are seamlessly integrated, which can overcome the problem of label scarcity.

4.1 Algorithm Psudo-Code

Algorithm 1 outlines the proposed approach. For the normal user identification based on joint clustering, we first initialize the VAE and clustering centers (Line 1), and then train the joint clustering network (Lines 2-4). After that, we detect normal and unstable users, and obtain corresponding representation (Lines 5-6). For the suspicious user detection based on triplet ranking, we construct the triplets (Line 8), and train the triplet ranking model to obtain the corresponding anomaly scores (Lines 9-10). Finally, we predict suspects among the identified unstable users (Line 11).

figure a

5 Experiments

5.1 Datasets

We conduct experiments on three real-world datasets [27] collected by three branches of a gas group, denoted by companies A, B, and C for short. In total, there are 3,035 users while only 11 users are labeled as gas thefts. Specifically, there are 584, 781, and 1,670 users with 4, 2, and 5 labeled thefts for companies A, B, and C, respectively. Each boiler user has daily gas consumption, and the time span lasts from November 15, 2018 to March 15, 2019.

5.2 Parameter Setting

5.3 Data Preparation

For the case of occasional missing data, we use the forward-filling method. For each dataset, we normalize the data between [-1,1].

5.4 Normal User Identification

The cluster number of K-means is set to 2. The length of the gas sequence is 121, and the dimension of latent space is 16. The size of the symmetric hidden layer is 64, with ReLu as the activation function. The margin parameter \(\gamma\), \(\beta\) and b is set to 1, 0.5 and 0.5. \(h(z)= Normal(0,1)\) is the standard normal distribution. For pre-training, the feature extractor is trained for 100 epochs with Adam, then trained for further 70 epochs updating the joint clustering. The parameters batch size and learning rate for each subset are set by grid search. For datasets A and B, the batch size is 15 and the learning rate is 0.0001. For dataset C, the batch size is 25 and the learning rate is 0.0005.

5.5 Suspicious User Detection

For ranking network consists of two hidden layers to learn more intricate data interactions, with 8 and 4 hidden units respectively. The network is trained for 100 epochs with Adam and each unit uses ReLu as the activation function. The parameter c is set to 3 and \(\sigma _R\) is set to 1.

5.6 Evaluation Methods

We adopt cross-validation on three subsets for evaluation, where two are the training set, and the left one is for evaluation. We use precision (P) and recall (R) for evaluation. Due to the scarcity of labels, the detected anomalies should cover as many labels as possible and avoid false alarms. Therefore, when the recall rate is close to 1, the precision will be as high as possible.

5.7 Baseline Methods

  • Deep SVDD [20]: an unsupervised anomaly detection based on kernel-based single class classification, which minimizes the hypersphere volume in the AE-based sample feature space. Eps is set to \(10^{-6}\).

  • DBSCAN [5]: a density-based clustering algorithm, which defines the cluster as the largest set of points connected by density. Here, MinPts is set to 4 and eps is set to 0.75.

  • DAGMM [31]: an unsupervised anomaly detection method combining an automatic encoder and a Gaussian mixture model. Here, the number of training epochs is set to 200 and the size of mini-batches is 256.

  • SRCNN [18]: a time series anomaly detection by combining the SR and CNN. It adopts the Spectral Residual in the domain of computer vision to strengthen anomalies. Parameters are set as [18] suggests.

  • SVOC [27]: which uses the temperature deformation method to find normal users, and trains OCSVM with them. Unstable users are ranked by the probability the trained OCSVM predicts. Parameters are set as [27] suggests.

  • msRank [26]: which uses gas consumption mode clustering to find normal users and discovers gas-theft suspects among unstable users by RankNet-based suspicion scoring. Parameters are set as [26] suggests.

  • Deep SAD [19]: which is optimized based on Deep SVDD, and realizes that the entropy of the potential distribution of normal data is lower than that of abnormal distribution. Here, \(\eta\) is set to 1 and eps is set to \(10^{-6}\).

  • SSD\(_k\) [21]: which takes the Mahalanobis distance to the nearest cluster center as the measure of anomaly degree. Here, the number of training epochs is set to 50 and the size of mini-batches is 15.

5.8 Performance Comparison

Table 1 Performance comparison with baselines

5.9 Comparison With Baselines

As Table 1 illustrates, we compare Neural Clustering and Ranking Approach (NCRA) with various baselines. Though Deep SVDD and DAGMM can be used in unsupervised situations, neither of them shows good performance when no normal labels are available. DBSCAN does not perform well on high-dimensional time series, which is related to abnormal users gathering together in the form of small clusters. SRCNN, Deep SAD, and \(SSD_k\) leverage the gas-theft labels and all unlabeled data, which makes their hit rates higher. Due to the label scarcity of the realistic condition, the data is mixed with normal and unlabeled abnormal users. Thus, using unlabeled data indiscriminately will degrade the performance of these detection methods. SVOC and msRank manually extract simple statistical features and further learn with a shallow model, which may fail to capture complex gas usage patterns. Different from them, NCRA achieves the best recall with higher precision, since it deeply excavates the regular pattern of gas consumption behavior and tightly connects normal user identification and suspicious user detection modules to solve the problem of label sparsity.

5.10 Comparison With Joint Clustering Variants

As shown in Table 2, we compared the joint clustering (JC) with its variants with different representation learning and different loss function combinations. Here, we replace VAE with other representation learning models, like autoencoder (AE) and Gated Recurrent Unit (GRU), while the other parts are consistent in pre-training and joint clustering. JC w/o cons indicates that the constraint loss is not considered, that is, \(L_m\) is set to 0 in the Eq. 4. The results show that the anomalies detected by JC can cover the labels as much as possible, avoid false positives and achieve high accuracy. Unlike GRU and AE for feature representation, VAE is more robust to noise and can better learn the representation of gas use behavior. While for the constraint loss, it can avoid mapping all samples into the same user cluster, eliminating the degenerate solution. Note that this is only the first module of the whole method, so it is necessary to ensure that recall is higher, and the accuracy can be improved in the second module.

Table 2 Comparison with different joint clustering

5.11 Comparison With Triplet Ranking Variants

As for the two-step methods presented in Table 3, we compare triplet ranking (TR) with typical classifier and rank tasks. OCSVM does not perform well since it only focuses on modeling normal users. Moreover, the data utilization of labeled gas thefts is low as it is only considered to set the statistical threshold. Ranknet only considers the difference between a single normal sample and an abnormal sample. Compared with these rank models, TR deeply achieves the best performance in reducing the false-positive rate. This is because TR mines the behavior correlations to obtain easy-to-explain anomaly scores, and uses the triplet-wise training method to generate more training samples to improve data utilization.

Table 3 Comparison with substitutes for triplet ranking

5.12 Visualization of Triplet Ranking Variants

We visualize the triplet ranking variants by using t-SNE to reduce the dimension of potential representation Z from 16 to 2, and draw the partition of unstable users in Fig. 4. Here, different colors represent different user groups in the figure, blue dots represent irregular users, and purple represents abnormal users. We can find that irregular users account for the majority and gather in the middle, while abnormal users account for a small part and are scattered around. The two user groups of JC & Ranknet are not clearly divided. The abnormal user group of JC & OCSVM is obviously distributed around, but the number of abnormal users is higher. JC & TR has the best performance, in which the number of the detected abnormal users is small and distributed around, reflecting the power of the triplet ranking.

Fig. 4
figure 4

Gas-theft suspects and irregular users. Visualizes the unstable user partitions of JC & Ranknet, JC & OCSVM, JC & TR in 2D. The blue dot indicates irregular users, and the purple indicates abnormal users

5.13 Detected Suspects

Taking the dataset of company A, we rank the unstable users based on the corresponding anomaly scores to get the gas-theft suspects. Figure 5 shows the original gas consumption curve corresponding to the three detected suspects. The red circle indicates that the gas consumption fluctuates too much, that is, the gas consumption is often increased or decreased greatly or even restarted frequently and irregularly. Our approach can mainly detect frequent and large fluctuations, while it is tolerable that some boiler rooms occasionally need to be closed and adjusted in a small range in special circumstances.

Fig. 5
figure 5

Abnormal examples with the highest anomaly scores. Shows the original gas consumption curve corresponding to the three detected suspects. The horizontal axis represents the date, and the vertical axis represents the gas consumption. The red circle indicates that the gas consumption fluctuates too much, that is, the gas consumption is often increased or decreased greatly or even restarted frequently and irregularly

5.14 Parameter Analysis

5.15 Cluster Number of Joint Clustering

We take the number of JC clusters equal to the number of classes as a priori knowledge. To demonstrate the representation ability of JC as an unsupervised clustering model, we set the number of clusters in the range of \(\left\{ 2, 3, 4, 5, 10, 20\right\}\). As shown in Fig. 6, the precision is the highest when the cluster number is 2. The purpose of JC is to divide users into normal users and unstable users. However, when the number of clusters is bigger than 2, normal users or unstable users may be clustered into multiple clusters due to the diversity of user behavior. In this way, we can not know which cluster represents normal users, resulting in a worse clustering effect.

Fig. 6
figure 6

Precision of joint clustering under different cluster number. Shows the joint clustering precision of three data sets when the number of clusters is 2, 3, 4, 5, 10, 20. Blue is dataset A, yellow is dataset B, and green is dataset C

5.16 Representation Dimension of Joint Clustering

As the input format of clustering, we set the dimension of hidden layer Z in the range of \(\left\{ 4, 8, 16, 32, 64\right\}\). As shown in Fig. 7a, if the dimension is 16, the experiment has impressive precision and recall. When the dimension is 4, the higher precision is obtained by the lower recall, which deviates from our original intention. When the representation dimension is too high, the possibility of clusters is reduced due to sparse data distribution and a large number of irrelevant attributes in high-dimensional data. Furthermore, if the representation dimension is too low, the data will lose some important attributes. Hence, choosing the appropriate representation dimension helps to explain the dataset well.

Fig. 7
figure 7

Precision and recall of joint clustering under representation dimensions. Shows the joint clustering precision and recall of the three datasets when the dimension of hidden layer Z is 4, 8, 16, 32, and 64. Blue is dataset A, yellow is dataset B, and green is dataset C

5.17 Sampling Method of Triplet Ranking

As shown in Table 4, we compared the different values of reference score \(\mu _R\) and deviation \(\sigma _R\) in Eq. 5 of triplet ranking (TR). There are two main ways to generate \(\mu _R\) and \(\sigma _R\): data driven method is based on feature data, while prior driven method is based on Gaussian prior probability. For deviation \(\sigma _R\), we can set \(\sigma _R=1\) based on prior-driven methods follow [16], or the standard deviation (SD) of the anomaly scores of the normal data set N based on data-driven methods. For reference score \(\mu _R\), we can select the average, median, maximum, or minimum of the anomaly scores of N. When the \(\mu _R\) is the average anomaly score and \(\sigma _R\) is 1, the result is more impressive. Compared with the maximum, median, and minimum, the average anomaly scores can better represent the normal sample group.

Table 4 Comparison with different deviation loss

5.18 Number of Labeled Anomalies

To further explore the improvement affected by the number of labeled anomalies, the precision with different number of gas thefts in JC and TR is shown in Fig. 8a and b. We selected abnormal users with high reliability in the experimental results, and doubled the number of abnormal users marked for data enhancement, but the experimental results did not change. This is because there are many ways to steal gas in the actual scene, which leads to the ever-changing impact on the data layer. Therefore, data enhancement cannot reduce these changes. In addition, our NCRA techniques can better learn these rules by characterizing the differences between stolen gas labels and conventional gas use, rather than just learning about labels.

Fig. 8
figure 8

Precision of the different number of labeled anomalies. Shows the precision of joint clustering and triplet ranking when the number of labeled anomalies is different. Blue represents the number of original labeled data, and yellow represents twice the number of original labeled data

6 Related Work

6.1 Gas Theft Detection

Natural gas theft detection mainly relies on on-site inspection by employees of the natural gas company, which consumes manpower and material resources [24]. The intelligent instruments regularly report massive gas consumption records, which provides an opportunity for the data-driven detection method. The previous two gas-theft detection articles first divided normal users and unstable users by business characteristics, and then detected suspected users from unstable users by off-the-shelf classifiers [, 26, 27]. The previously proposed data-driven methods set statistical indicators and manually extract simple statistical features, which may fail to capture complex gas usage patterns. Different from them, we use deep learning to detect stolen gas, which is more capable of discovering potential suspects.

6.2 PU Learning

Learning from positive and unlabeled data or PU learning is the task where a learner only has access to positive examples and unlabeled data [10]. The assumption is that the unlabeled data contain both positive and negative examples [4]. This task has attracted increasing interest within the machine learning literature as this type of data naturally arises in many applications [1]. Most methods can be divided into the following three categories: two-step methods [14], biased learning [9] and class prior incorporation [23]. Due to the scarcity of labeled data, we use two-step methods to closely combine the two modules. The common two-step methods using off-the-shelf classifiers directly will lead to low data utilization. Instead, we generate more training samples by triplet-wise training method to improve the data utilization.

6.3 Urban Anomaly Detection

Urban anomalies are typically unusual events occurring in urban environments that may endanger public safety [30]. Recently, data-driven urban anomaly analysis frameworks have been forming, which utilize urban big data and machine learning to detect urban anomalies automatically [2]. The existing works that specifically focus on urban anomaly detection can be categorized into three groups [29]: spatiotemporal feature based, urban dynamic pattern based and video anomaly detection methods. As an individual anomaly detection task, we use feature-based methods to focus on better feature extraction. In the gas scene, we use joint clustering to learn the behavior representation of the original sequence and synchronously divide normal users, and then detect suspects based on triplet ranking.

7 Conclusion

In this paper, we propose a neural clustering and ranking approach to detect gas-theft suspects of boiler room users. Considering the consistency of behavior rules of most normal users, we first distinguish normal users from unstable users and obtain the representation by joint clustering. Based on the detection results, the triplet ranking network detects gas-theft suspects among unstable users by ranking anomaly scores. Experimental results on three real-world datasets demonstrated the superiority of our approach over various baselines in reducing the false-positive rate.

In addition, through the characteristics of NCRA, we show its ability to detect abnormal tasks in time series, which can identify normal users and then classify suspicious users without using supervision information during training. In the future, we will focus on generalizing our approach to more types of gas users and other urban anomaly detection tasks. We also can design a real-time system. The gas consumption records collected by the electricity meter will be reported to Hive, and the data will be migrated to MySQL database using Sqoop every week. Then NCRA provides a suspicious list based on the data of the past 30 days every week.