Keywords

1 Introduction

Imagine that you manage a social review site (e.g., Yelp) and have the records of which accounts wrote reviews for which restaurants. How do you detect suspicious lockstep behavior: for example, a set of accounts which give fake reviews to the same set of restaurants? What about the case where additional information is present, such as the timestamp of each review, or the keywords in each review?

Such problems of detecting suspicious lockstep behavior have been extensively studied from the perspective of dense subgraph detection. Intuitively, in the above example, highly synchronized behavior induces dense subgraphs in the bipartite review graph of accounts and restaurants. Indeed, methods which detect dense subgraphs have been successfully used to spot fraud in settings ranging from social networks [5, 10, 13, 14], auctions [20], and search engines [8].

Additional information helps identify suspicious lockstep behavior. In the above example, the fact that reviews forming a dense subgraph were also written at about the same time, with the same keywords and number of stars, makes the reviews even more suspicious. A natural and effective way to incorporate such extra information is to model data as a tensor and find dense blocks in it [12, 19].

Fig. 1.
figure 1

M-Zoom is fast, accurate, and effective. Fast: (a) M-Zoom was 55\(\times \) faster with denser blocks than CrossSpot in Korean Wikipedia Dataset. Accurate: (a) M-Zoom found 24\(\times \) denser blocks than CPD. (b) M-Zoom identified network attacks with near-perfect accuracy (AUC = 0.98). Effective: (c) M-Zoom spotted edit wars, during which many users (distinguished by colors) edited the same set of pages hundreds of times within several hours. (d) M-Zoom spotted bots, and pages edited hundreds of thousands of times by the bots. (Color figure online)

However, neither existing methods for detecting dense blocks in tensors nor simple extensions of graph-based methods are satisfactory in terms of speed, accuracy, or flexibility. Especially, the types of fraud detectable by each of the methods are limited since, explicitly or implicitly, each method is based on only one density metric, which decides how dense and thus suspicious each block is.

Hence, in this work, we propose M-Zoom (Multidimensional Zoom), a general and flexible framework for detecting dense blocks in tensors. M-Zoom allows for a broad class of density metrics, in addition to having the following strengths:

  • Scalable: M-Zoom is up to 114 \(\times \) faster than state-of-the-art methods with similar accuracy (Fig. 2) thanks to its linear scalability with all aspects of tensors (Fig. 4).

  • Provably accurate: M-Zoom provides a guarantee on the lowest density of blocks it finds (Theorem 4), as well as shows high accuracy similar with state-of-the-art methods in real-world datasets (Fig. 1a).

  • Flexible: M-Zoom works successfully with high-order tensors and supports various density measures, multi-block detection, and size bounds (Table 1).

  • Effective: M-Zoom successfully detected edit wars and bot activities in Wikipedia (Figs. 1c and d), and also detected network attacks with near-perfect accuracy (AUC = 0.98) based on TCP dump data (Fig. 1b).

Reproducibility: Our open-sourced code and the data we used are available at http://www.cs.cmu.edu/~kijungs/codes/mzoom.

Section 2 presents preliminaries and problem definitions. Our proposed M-Zoom is described in Sect. 3 followed by experimental results in Sect. 4. After discussing related work in Sect. 5, we draw conclusions in Sect. 6.

Table 1. M-Zoom is flexible. Comparison between M-Zoom and other methods for dense-block detection. represents ‘supported’.

2 Preliminaries and Problem Definition

In this section, we introduce definitions and notations used in the paper. We also discuss density measures and give a formal definition of our problems.

2.1 Definitions and Notations

Let \(\varvec{\mathcal {R}}(A_{1},A_{2},...,A_{N},X)\) be a relation with N dimension attributes \(A_{1}\), \(A_{2}\), ..., \(A_{N}\), and a nonnegative measure attribute X (see the supplementary document [1] for a running example and its pictorial description). We use \(\varvec{\mathcal {R}}_{n}\) to denote the set of distinct values of \(A_{n}\) in \(\varvec{\mathcal {R}}\), and use \(a_{n}\in \varvec{\mathcal {R}}_{n}\) for a value of \(A_{n}\). The value of \(A_{n}\) in tuple t is denoted by \(t[A_{n}]\), and the value of X is denoted by t[X]. The relation \(\varvec{\mathcal {R}}\) can be represented as an N-way tensor. In the tensor, each n-th mode has length \(|\varvec{\mathcal {R}}_{n}|\), and each cell has the value of attribute X, if the corresponding tuple exists, and 0 otherwise. Let \(\varvec{\mathcal {B}}_{n}\) be a subset of \(\varvec{\mathcal {R}}_{n}\). Then, we define a block \(\varvec{\mathcal {B}}(A_{1},A_{2},...,A_{N}, X)=\{t\in \varvec{\mathcal {R}}: 1 \le \forall n \le N, t[A_{n}]\in \varvec{\mathcal {B}}_{n}\}\), the set of tuples where each dimension attribute \(A_{n}\) has a value in \(\varvec{\mathcal {B}}_{n}\). \(\varvec{\mathcal {B}}\) is called ‘block’ because it forms a subtensor where each n-th mode has length \(|\varvec{\mathcal {B}}_{n}|\) in the tensor representation of \(\varvec{\mathcal {R}}\). The set of tuples of \(\varvec{\mathcal {R}}\) with attribute \(A_{n}=a_{n}\) is denoted by \(\varvec{\mathcal {R}}(a_{n})=\{t\in \varvec{\mathcal {R}}: t[A_{n}] = a_{n}\}\). We define the mass of \(\varvec{\mathcal {R}}\) as \(M_{\varvec{\mathcal {R}}}=Mass(\varvec{\mathcal {R}})=\sum _{t\in \varvec{\mathcal {R}}}t[X]\), the sum of the values of attribute X in \(\varvec{\mathcal {R}}\). We also define the size of \(\varvec{\mathcal {R}}\) as \(S_{\varvec{\mathcal {R}}}=Size(\varvec{\mathcal {R}})=\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|\) and the volume of \(\varvec{\mathcal {R}}\) as \(V_{\varvec{\mathcal {R}}}=Volume(\varvec{\mathcal {R}})=\prod _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|\). Lastly, we use \([x]=\{1,2...,x\}\) for convenience. Table 2 lists frequently used symbols.

Table 2. Table of symbols.

2.2 Density Measures

In this paper, we consider three specific density measures although our method is not restricted to them. Two of the density measures (Definitions 1 and 2) are natural multi-dimensional extensions of classic density measures which have been widely used for subgraphs. The merits of the original measures are discussed in [7, 15], and extensive research based on them is discussed in Sect. 5.

Definition 1

(Arithmetic Average Mass [7]). The arithmetic average mass of a block \({\varvec{\mathcal {B}}}\) of a relation \({\varvec{\mathcal {R}}}\) is defined as \(\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}/(S_{\varvec{\mathcal {B}}}/N)\).

Definition 2

(Geometric Average Mass [7]). The geometric average mass of a block \(\varvec{\mathcal {B}}\) of a relation \(\varvec{\mathcal {R}}\) is defined as \(\rho _{geo}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {B}}}^{(1/N)}\).

The other density measure (Definition 3) is the negative log likelihood of \(M_{\varvec{\mathcal {B}}}\) on the assumption that the value on each cell (in the tensor representation) of \(\varvec{\mathcal {R}}\) follows a Poisson distribution. This proved useful in fraud detection [12].

Definition 3

(Suspiciousness [12]). The suspiciousness of a block \(\varvec{\mathcal {B}}\) of a relation \(\varvec{\mathcal {R}}\) is defined as \(\rho _{susp}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}(\log (M_{\varvec{\mathcal {B}}}/M_{\varvec{\mathcal {R}}})-1)+M_{\varvec{\mathcal {R}}}V_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {R}}}-M_{\varvec{\mathcal {B}}}\log (V_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {R}}})\).

Our method, however, is not restricted to the three measures mentioned above. Our method, which searches for dense blocks in a tensor, allows for any density measure \(\rho \) that satisfies Axiom 1.

Axiom 1

(Density Axiom). If two blocks of a relation have the same cardinality for every dimension attribute, the block with higher or equal mass is at least as dense as the other. Formally,

$$M_{\varvec{\mathcal {B}}} \ge M_{\varvec{\mathcal {B}}'}\, { and }\, |\varvec{\mathcal {B}}_{n}|=|\varvec{\mathcal {B}}'_{n}|, \forall n\in [N] \Rightarrow \rho (\varvec{\mathcal {B}},\varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}',\varvec{\mathcal {R}}).$$

2.3 Problem Definition

We formally define the problem of detecting the k densest blocks in a tensor.

Problem 1 (k-Densest Blocks)

(1) Given: a relation \(\varvec{\mathcal {R}}\), the number of blocks k, and a density measure \(\rho \), (2) Find: k distinct blocks of \(\varvec{\mathcal {R}}\) with the highest densities in terms of \(\rho \).

We also consider a variant of Problem 1 which incorporates lower and upper bounds on the size of the detected blocks. This is particularly useful if the unrestricted densest block is not meaningful due to being too small (e.g. a single tuple) or too large (e.g. the entire tensor).

Problem 2 (k-Densest Blocks with Size Bounds)

(1) Given: a relation \(\varvec{\mathcal {R}}\), the number of blocks k, a density measure \(\rho \), lower size bound \(S_{min}\), and upper size bound \(S_{max}\), (2) Find: k distinct blocks of \(\varvec{\mathcal {R}}\) with the highest densities in terms of \(\rho \) (3) Among: blocks whose sizes are at least \(S_{min}\) and at most \(S_{max}\).

Even when we restrict our attention to a special case (N=2, k=1, \(\rho \)=\(\rho _{ari}\), \(S_{min}\)=\(S_{max}\)), exactly solving Problems 1 and 2 takes \(O(S_{\varvec{\mathcal {R}}}^6)\) time [9] and is NP-hard [3], resp., infeasible for large datasets. Thus, we focus on an approximation algorithm which (1) has linear scalability with all aspects of \(\varvec{\mathcal {R}}\), (2) provides accuracy guarantees at least for some density measures, and (3) produces meaningful results in real-world datasets, as explained in detail in Sects. 3 and 4.

3 Proposed Method

In this section, we propose M-Zoom (Multidimensional Zoom), a scalable, accurate, and flexible method for finding dense blocks in a tensor. We present the details of M-Zoom in Sect. 3.1 and discuss its efficient implementation in Sect. 3.2. After analyzing the time and space complexity in Sect. 3.3, we prove the quality guarantees provided by M-Zoom in Sect. 3.4.

3.1 Algorithm

Algorithm 1 describes the outline of M-Zoom. M-Zoom first copies the given relation \(\varvec{\mathcal {R}}\) and assigns it to \(\varvec{\mathcal {R}}^{ori}\) (line 1). Then, M-Zoom finds k dense blocks one by one from \(\varvec{\mathcal {R}}\) (line 4). After finding each block from \(\varvec{\mathcal {R}}\), M-Zoom removes the tuples in the block from \(\varvec{\mathcal {R}}\) to prevent the same block from being found again (line 5). Due to these changes in \(\varvec{\mathcal {R}}\), a block found in \(\varvec{\mathcal {R}}\) is not necessarily a block of the original relation \(\varvec{\mathcal {R}}^{ori}\). Thus, instead of returning the blocks found in \(\varvec{\mathcal {R}}\), M-Zoom returns the blocks of \(\varvec{\mathcal {R}}^{ori}\) consisting of the same attribute values with the found blocks (lines 6–7). This also enables M-Zoom to find overlapped blocks, i.e., a tuple can be included in two or more blocks.

figure a
figure b

Algorithm 2 describes how M-Zoom finds a single dense block from the given relation \(\varvec{\mathcal {R}}\). The block \(\varvec{\mathcal {B}}\) is initialized to \(\varvec{\mathcal {R}}\) (lines 1–2). From \(\varvec{\mathcal {B}}\), M-Zoom removes attribute values one by one in a greedy way until no attribute value is left (line 4). Specifically, M-Zoom finds the attribute value \(a_{i}\) that maximizes \(\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})\), which corresponds to the density when tuples with \(A_{i}=a_{i}\) are removed from \(\varvec{\mathcal {B}}\) (line 7). Then, the attribute value, denoted by \(a^{*}_{i}\), and the tuples with \(A_{i}=a^{*}_{i}\) are removed from \(\varvec{\mathcal {B}}_{i}\) and \(\varvec{\mathcal {B}}\), respectively (lines 8–9). Before removing each attribute value, M-Zoom adds the current \(\varvec{\mathcal {B}}\) to the snapshot list if \(\varvec{\mathcal {B}}\) satisfies the size bound (i.e., \(S_{min}\le S_{\varvec{\mathcal {B}}} \le S_{max}\)) (lines 5–6). As the final step of finding a block, M-Zoom returns the block with the maximum density among those in the snapshot list (line 10).

3.2 Efficient Implementation of M-Zoom

In this section, we discuss an efficient implementation of M-Zoom focusing on the greedy attribute value selection and the densest block selection.

figure c

Attribute Value Selection Using Min-Heaps. Finding the attribute value \(a_{i} \in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}\) that maximizes \(\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})\) (line 7 of Algorithm 2) can be computationally very expensive if all possible attribute values (i.e., \(\bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}\)) should be considered. However, due to Axiom 1, which is assumed to be satisfied by considered density measures, the number of candidates is reduced to N if \(M_{\varvec{\mathcal {B}}(a_{i})}\) is known for each attribute value \(a_{i}\). Lemma 1 states this.

Lemma 1

If we remove a value of attribute \(A_{n}\) from \(\varvec{\mathcal {B}}_{n}\), removing \(a_{n}\in \varvec{\mathcal {B}}_{n}\) with minimum \(M_{\varvec{\mathcal {B}}(a_{n})}\) results in the highest density. Formally,

$$M_{\varvec{\mathcal {B}}(a'_{n})} \le M_{\varvec{\mathcal {B}}(a_{n})}, \forall a_{n}\in \varvec{\mathcal {B}}_{n} \Rightarrow \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a'_{n}), \varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a_{n}), \varvec{\mathcal {R}}), \forall a_{n}\in \varvec{\mathcal {B}}_{n}. $$

Proof

Let \(\varvec{\mathcal {B}}'=\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a'_{n})\) and \(\varvec{\mathcal {B}}''=\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{n})\). Then, \(|\varvec{\mathcal {B}}'_{n}|=|\varvec{\mathcal {B}}''_{n}|, \forall n\in [N]\). In addition, \(M_{\varvec{\mathcal {B}}'} \ge M_{\varvec{\mathcal {B}}''}\) since \(M_{\varvec{\mathcal {B}}'}=M_{\varvec{\mathcal {B}}} - M_{\varvec{\mathcal {B}}(a'_{n})} \ge M_{\varvec{\mathcal {B}}} - M_{\varvec{\mathcal {B}}(a_{n})}=M_{\varvec{\mathcal {B}}''}\). Hence, by Axiom 1, \(\rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a'_{n}),\varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a_{n}),\varvec{\mathcal {R}})\). \(\quad \square \)

By Lemma 1, if we let \(a'_{n}\) be \(a_{n}\in \varvec{\mathcal {B}}_{n}\) with minimum \(M_{\varvec{\mathcal {B}}(a_{n})}\), we only have to consider values in \(\{a'_{n}\}_{n=1}^{N}\) instead of \(\bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}\) to find the attribute value maximizing density when it is removed. To exploit this, our implementation of M-Zoom maintains a min-heap for each attribute \(A_{n}\) where the key of each value \(a_{n}\) is \(M_{\varvec{\mathcal {B}}(a_{n})}\). This key is updated, which takes O(1) if Fibonacci Heaps are used as min-heaps, whenever the tuples with the corresponding attribute value are removed. Algorithm 3 describes in detail how to find the attribute value to be removed based on these min-heaps, and update keys in them. Since Algorithm 3 considers all promising attribute values (i.e., \(\{a'_{n}\}_{n=1}^{N}\)), it is guaranteed to find the value that maximizes density when it is removed, as Theorem 1 states.

Theorem 1

Algorithm 3 returns \(a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}\) with maximum \(\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})\).

Proof

Let \(a_{i}^{*}\) be \(a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}\) with maximum \(\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})\). By Lemma 1, \(a_{i}^{*}\) exists among \(\{a'_{n}\}_{n=1}^{N}\), all of which are considered in Algorithm 3. \(\quad \square \)

Densest Block Selection Using Attribute Value Ordering. As explained in Sect. 3.1, M-Zoom returns the densest block among snapshots of \(\varvec{\mathcal {B}}\) (line 10 of Algorithm 2). Explicitly maintaining the list of snapshots, whose length is at most \(S_{\varvec{\mathcal {R}}}\), requires \(O(N|\varvec{\mathcal {R}}|S_{\varvec{\mathcal {R}}})\) computation and space for copying them. Even maintaining only the current best (i.e., the one with the highest density so far) cannot avoid high computational cost if the current best keeps changing. Instead, our implementation maintains the order by which attribute values are removed as well as the iteration where the density was maximized, which requires only \(O(S_{\varvec{\mathcal {R}}})\) space. From these and the original relation \(\varvec{\mathcal {R}}\), our implementation restores the snapshot with maximum density in \(O(N|\varvec{\mathcal {R}}|+S_{\varvec{\mathcal {R}}})\) time and returns it.

3.3 Complexity Analysis

The time and space complexity of M-Zoom depend on the density measure used. In this section, we assume that one of the density measures in Sect. 2.2, which satisfy Axiom 1, is used.

Theorem 2

The time complexity of Algorithm 1 is \(O(kN|\varvec{\mathcal {R}}|\log L)\) if \(|\varvec{\mathcal {R}}_{n}|=L\), \(\forall n\in [N]\), and \(N = O(\log L)\).

Proof

See Appendix B.

As stated in Theorem 2, M-Zoom scales linearly or sub-linearly with all aspects of relation \(\varvec{\mathcal {R}}\) as well as k, the number of blocks we aim to find. This result is also experimentally supported in Sect. 4.4. In our experiments, the actual running time scaled sub-linearly with k as well as L since the number of tuples in \(\varvec{\mathcal {R}}\) decreases as M-Zoom finds blocks (line 5 in Algorithm 1).

Theorem 3

The space complexity of Algorithm 1 is \(O(kN|\varvec{\mathcal {R}}|)\).

Proof

See the supplementary document [1]. \(\quad \square \)

M-Zoom requires up to \(kN|\varvec{\mathcal {R}}|\) space for storing k found blocks, as stated in Theorem 3. However, since the blocks are usually far smaller than \(\varvec{\mathcal {R}}\), as seen in Tables 4 and 5 in Sect. 4, actual space usage is much less than \(kN|\varvec{\mathcal {R}}|\).

3.4 Accuracy Guarantee

In this section, we show lower bounds on the densities of the blocks found by M-Zoom on the assumption that \(\rho _{ari}\) (Definition 1) is used as the density measure. Specifically, we show that Algorithm 2 without size bounds is guaranteed to find a block with density at least 1/N of maximum density in the given relation (Theorem 4). This means that each n-th block returned by Algorithm 1 has density at least 1/N of maximum density in \(\varvec{\mathcal {R}}-\bigcup _{i=1}^{n-1}(i\)-th block). Let \(\varvec{\mathcal {B}}^{(r)}\) be the relation \(\varvec{\mathcal {B}}\) at the beginning of the r-th iteration of Algorithm 2, and \(a_{i}^{(r)} \in \varvec{\mathcal {B}}^{(r)}_{i}\) be the attribute value removed in the same iteration.

Lemma 2

If a block \(\varvec{\mathcal {B}}'\) satisfying \(\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}'_{n}\), \(M_{\varvec{\mathcal {B}}'(a_{i})}\ge c\) exists, there exists \(\varvec{\mathcal {B}}^{(r)}\) satisfying \(\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}\), \(M_{\varvec{\mathcal {B}}^{(r)}(a_{i})}\ge c\).

Proof

See Appendix C. \(\quad \square \)

Theorem 4

(1/N -Approximation Guarantee for Problem 1 ). Given a relation \(\varvec{\mathcal {R}}\), let \(\varvec{\mathcal {B}}^{*}\) be the block \(\varvec{\mathcal {B}}\subset \varvec{\mathcal {R}}\) with maximum \(\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})\). Let \(\varvec{\mathcal {B}}'\) be the block obtained by Algorithm 2 without size bounds (i.e., \(S_{min}=0\) and \(S_{max}=\infty \)). Then, \(\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N\).

Proof

\(\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{*}_{n}\), \(M_{\varvec{\mathcal {B}}^{*}(a_{i})}\) \(\ge M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}\). Otherwise, a contradiction would result since for \(a_{i}\) with \(M_{\varvec{\mathcal {B}}^{*}(a_{i})}< M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}\),

$$ \rho _{ari}(\varvec{\mathcal {B}}^{*}-\varvec{\mathcal {B}}^{*}(a_{i}),\varvec{\mathcal {R}}) = \frac{M_{\varvec{\mathcal {B}}^{*}}-M_{\varvec{\mathcal {B}}^{*}(a_{i})}}{(S_{\varvec{\mathcal {B}}^{*}}-1)/N}> \frac{M_{\varvec{\mathcal {B}}^{*}}-M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}}{(S_{\varvec{\mathcal {B}}^{*}}-1)/N}=\rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}}). $$

Consider \(\varvec{\mathcal {B}}^{(r)}\) where \(\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}\), \(M_{\varvec{\mathcal {B}}^{(r)}(a_{i})}\ge M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}\). Such \(\varvec{\mathcal {B}}^{(r)}\) exists by Lemma 2. \(M_{\varvec{\mathcal {B}}^{(r)}}\ge (S_{\varvec{\mathcal {B}}^{(r)}}/N)\) \((M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}})=(S_{\varvec{\mathcal {B}}^{(r)}}/N)(\rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N)\). Hence, \(\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{(r)},\varvec{\mathcal {R}}) = M_{\varvec{\mathcal {B}}^{(r)}}/(S_{\varvec{\mathcal {B}}^{(r)}}/N) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N.\) \(\quad \square \)

Theorem 4 can be extended to cases where a lower bound exists. In these cases, the approximate factor is 1/(N+1), as stated in Theorem 5.

Theorem 5

( \(1/(N+1)\) -Approximation Guarantee for Problem 2 ). Given a relation \(\varvec{\mathcal {R}}\), let \(\varvec{\mathcal {B}}^{*}\) be the block \(\varvec{\mathcal {B}}\subset \varvec{\mathcal {R}}\) with maximum \(\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})\) among blocks with size at least \(S_{min}\). Let \(\varvec{\mathcal {B}}'\) be the block obtained by Algorithm 2 with lower size bound (i.e., \(1 \le S_{min}\le S_{\varvec{\mathcal {R}}}\) and \(S_{max}=\infty \)). Then, \(\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/(N+1)\).

Proof

See the supplementary document [1]. \(\quad \square \)

4 Experiments

We designed and performed experiments to answer the following questions:

  • Q1. How fast and accurately does M-Zoom detect dense blocks in real data?

  • Q2. Does M-Zoom find many different dense blocks in real data?

  • Q3. Does M-Zoom scale linearly with all aspects of data?

  • Q4. Which anomalies or fraud does M-Zoom spot in real data?

Table 3. Summary of real-world datasets.

4.1 Experimental Settings

All experiments were conducted on a machine with 2.67 GHz Intel Xeon E7-8837 CPUs and 1TB RAM. We compared M-Zoom with CrossSpot [12], CP Decomposition (CPD) [17] (see Appendix A for details), and MultiAspectForensics (MAF) [19]. M-Zoom and CrossSpot Footnote 1 were implemented in Java, and Tensor Toolbox [4] was used for CPD and MAF. Although CrossSpot was originally designed to maximize \(\rho _{susp}\), it can be extended to other density measures. These variants were used depending on the density measure compared in each experiment. In addition, we used CPD as a seed selection method of CrossSpot, which outperformed HOSVD used in [12] in terms of both speed and accuracy. We used diverse real-world datasets, grouped as follows:

  • User behavior logs: StackO.(user,post,timestamp,1) represents who marked which post as a favorite when on Stack Overflow. Youtube(user,user,date,1) represents who became a friend of whom when on Youtube. KoWiki(user,page, timestamp,#revisions) and EnWiki(user,page,timestamp,#revisions) represent who revised which page when how many times on Korean Wikipedia and English Wikipedia, respectively.

  • User reviews: Yelp(user,business,date,score,1), Netflix(user,movie,date,sco-re,1), and YahooM.(user,item,timestamp,score,1) represent who gave which score when to which business, movie, and item on Yelp, Netflix, and Yahoo Music, respectively.

  • TCP dumps: From TCP dump data for a typical U.S. Air Force LAN, we created a relation AirForce(protocol,service,src_bytes,dst_bytes,flag,host_count ,src_count,#connections). See the supplementary document [1] for the description of each attribute.

Timestamps are in hours in all the datasets. Table 3 summarizes all the datasets.

Fig. 2.
figure 2

Only M-Zoom achieves both speed and accuracy. In each plot, points represent the speed of different methods and the highest density (\(\rho _{ari}\)) of three blocks found by the methods. Upper-left region indicates better performance. M-Zoom gives the best trade-off between speed and density. Specifically, M-Zoom is up to 114 \(\times \) faster than CrossSpot with similarly dense blocks.

4.2 Q1. Running Time and Accuracy of M-Zoom

We compare the speed of different methods and the densities of the blocks found by the methods in real-world datasets. Specifically, we measured time taken to find three blocks and the maximum density among the three blocks. Figure 2 shows the result when \(\rho _{ari}\) was used as the density measure. M-Zoom clearly provided the best trade-off between speed and accuracy in all datasets. For example, in YahooM. Dataset, M-Zoom was 114 times faster than CrossSpot, while detecting blocks with similar densities. Compared with CPD, M-Zoom detected two times denser blocks 2.8 times faster. Although the results are not included in Fig. 2, MAF found several orders of magnitude sparser blocks than the other methods, with speed similar to that of CPD. M-Zoom also gave the best trade-off between speed and accuracy when \(\rho _{geo}\) or \(\rho _{susp}\) was used instead of \(\rho _{ari}\) (see the supplementary document [1]).

Fig. 3.
figure 3

M-Zoom detects many different dense blocks. The dense blocks found by M-Zoom and CPD have high diversity, while the dense blocks found by CrossSpot tend to be almost same.

4.3 Q2. Diversity of Blocks Found by M-Zoom

We compare the diversity of dense blocks found by each method. Ability to detect many different dense blocks is useful since distinct blocks may indicate different anomalies or fraud. We define the diversity as the average dissimilarity between the pairs of blocks, and the dissimilarity of two blocks is defined as \(dissimilarity(\varvec{\mathcal {B}}, \varvec{\mathcal {B}}') = 1-\frac{ |(\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}_{n}}) \cap (\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}'_{n}})|}{|(\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}_{n}}) \cup (\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}'_{n}})|}.\) Diversities were measured among three blocks found by each method using \(\rho _{ari}\) as the density metric.

As seen in Fig. 3, in all datasets, M-Zoom and CPD successfully detected distinct dense blocks. CrossSpot, however, found the same block repeatedly or blocks with slight difference, even when it started from different seed blocks. Although using CPD for seed-block selection in CrossSpot improved the diversity, the effect was limited in most datasets. Similar results were obtained when \(\rho _{geo}\) or \(\rho _{susp}\) was used instead of \(\rho _{ari}\) (see the supplementary document [1]).

Fig. 4.
figure 4

M-Zoom is scalable. (a) (b) M-Zoom scales linearly with the number of tuples and the number of attributes. (c) (d) M-Zoom scales sub-linearly with the cardinalities of attributes and the number of blocks we aim to find.

4.4 Q3. Scalability of M-Zoom

We empirically demonstrate the scalability of M-Zoom, mathematically analyzed in Theorem 2. Specifically, we measured the scalability of M-Zoom with regard to the number of tuples, the number of attributes, the cardinalities of attributes, and the number of blocks we aim to find. We started with finding one block in a randomly generated 10 millions tuples with three attributes each of whose cardinality is 100 K. Then, we measured the running time by changing one factor at a time while fixing the others. As seen in Fig. 4, M-Zoom scaled linearly with the number of tuples and the number of attributes. Moreover, M-Zoom scaled sub-linearly with the number of blocks we aim to find as well as the cardinalities of attributes due to the reason explained in Sect. 3.3. These results held regardless of the density measure used.

4.5 Q4. Anomaly/Fraud Detection by M-Zoom in Real Data

We demonstrate the effectiveness of M-Zoom for anomaly and fraud detection by analyzing dense blocks detected by M-Zoom in real-world datasets.

M-Zoom spots edit wars and bot activities in Wikipedia. Table 4 lists the first three dense blocks found by M-Zoom in EnWiki and KoWiki Datasets. As seen in the third dense block visualized in Fig. 1c, the dense blocks detected in KoWiki Dataset indicate edit wars. That is, users with conflicting opinions revised the same set of pages hundreds of times within several hours. On the other hand, the dense blocks detected in EnWiki Dataset indicate the activities of bots, which changed the same pages hundreds of thousands of times. Figure 1d lists the bots and pages corresponding to the second found block.

Table 4. M-Zoom detects anomalous behaviors in Wikipedia. The tables list the first three blocks detected by M-Zoom in KoWiki and EnWiki Datasets, which correspond to edit wars and bot activities, respectively.
Table 5. M-Zoom identifies network attacks with near-perfect accuracy. The first three blocks found by M-Zoom in AirForce Dataset consist of attacks.

M-Zoom spots network intrusions. Table 5 lists the first three blocks found by M-Zoom in AirForce Dataset. Based on the provided ground truth labels, all of the about 3 millions connections composing the blocks were attacks except only one normal connection. This indicates that malicious connections form dense blocks due to the similarity in their behaviors. Based on this observation, we could accurately separate normal connections and attacks based on the densities of blocks they belong (i.e., the denser block a connection belongs, the more suspicious it is). Especially, we got the highest AUC (Area under the curve) 0.98 with M-Zoom, as shown in Fig. 1b, because M-Zoom detects many different dense blocks accurately, as shown in previous experiments. For each method, we used the best density measure that leads to the highest AUC.

5 Related Work

Dense Subgraph/Submatrix/Subtensor Detection. The densest subgraph problem, the problem of finding the subgraph which maximizes \(\rho _{ari}\) or \(\rho _{geo}\) (see Definitions 1 and 2), has been extensively studied in theory (see [18] for surveys). The two major directions are max-flow based exact algorithms [9, 16] and greedy algorithms [7, 16] giving a 1/2-approximation to the densest subgraph. Variants allow for size restrictions [3], providing a 1/3-approximation to the densest subgraph for the lower bound case. Another related line of research deals with dense blocks in binary matrices or tensors where the definition of density is designed for the purpose of frequent itemset mining [22] or formal concept mining [6, 11].

Anomaly/Fraud Detection based on Dense Subgraphs. Spectral approaches make use of eigendecomposition or SVD of the adjacency matrix for dense-block detection. Such approaches have been used to spot anomalous pattens in a patent graph [21], lockstep followers in a social network [14], and stealthy or small-scale attacks in social networks [23]. Other approaches include NetProbe [20], which used belief propagation to detect fraud-accomplice bipartite cores in an auction network, and CopyCatch [5], which used one-class clustering and sub-space clustering to identify “Like” boosting in Facebook. In addition, OddBall [2] spotted near-cliques in links among posts in blogs based on egonet features. Recently, Fraudar [10], which generalizes densest subgraph-detection methods so that the suspiciousness of nodes and edges can be incorporated, spotted follower-buying services in Twitter.

Anomaly/Fraud Detection based on Dense Subtensors. Spectral methods for dense subgraphs can be extended to tensors where tensor decomposition, such as CP Decomposition and HOSVD [17], is used to spot dense subtensors. MAF [19], which is based on CP Decomposition, detected dense blocks corresponding to port-scanning activities based on network traffic logs. Another approach is CrossSpot [12], which finds dense blocks by starting from seed blocks and growing them in a greedy way until \(\rho _{susp}\) (see Definition 3) converges. CrossSpot spotted retweet boosting in Weibo, outperforming HOSVD.

Our M-Zoom non-trivially generalizes theoretical results regarding the densest subgraph problem, especially [3], for supporting tensors, various density measures, and multi-block detection. As seen in Table 1, M-Zoom provides more flexibility than other methods for dense-block detection.

6 Conclusion

In this work, we propose M-Zoom, a flexible framework for finding dense blocks in tensors, which has the following advantages over state-of-the-art methods:

  • Scalable: M-Zoom is up to 114 \(\times \) faster than competitors with similar accuracy due to its linear scalability with all input factors (Figs. 2 and 4).

  • Provably accurate: M-Zoom provides lower bounds on the densities of the blocks it finds (Theorem 4) as well as high accuracy in real data (Fig. 2).

  • Flexible: M-Zoom supports high-order tensors, various density measures, multi-block detection, and size bounds (Table 1).

  • Effective: M-Zoom successfully detected fraud based on a TCP dump with near-perfect accuracy (AUC = 0.98), and anomalies in Wikipedia (Fig. 1).

Reproducibility: Our open-sourced code and the data we used are at http://www.cs.cmu.edu/~kijungs/codes/mzoom.