M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees

Shin, Kijung; Hooi, Bryan; Faloutsos, Christos

doi:10.1007/978-3-319-46128-1_17

Kijung Shin¹⁷,
Bryan Hooi¹⁷ &
Christos Faloutsos¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9851))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5636 Accesses
46 Citations

Abstract

Given a large-scale and high-order tensor, how can we find dense blocks in it? Can we find them in near-linear time but with a quality guarantee? Extensive previous work has shown that dense blocks in tensors as well as graphs indicate anomalous or fraudulent behavior (e.g., lockstep behavior in social networks). However, available methods for detecting such dense blocks are not satisfactory in terms of speed, accuracy, or flexibility. In this work, we propose M-Zoom, a flexible framework for finding dense blocks in tensors, which works with a broad class of density measures. M-Zoom has the following properties: (1) Scalable: M-Zoom scales linearly with all aspects of tensors and is up to 114 $\times $ faster than state-of-the-art methods with similar accuracy. (2) Provably accurate: M-Zoom provides a guarantee on the lowest density of the blocks it finds. (3) Flexible: M-Zoom supports multi-block detection and size bounds as well as diverse density measures. (4) Effective: M-Zoom successfully detected edit wars and bot activities in Wikipedia, and spotted network attacks from a TCP dump with near-perfect accuracy (AUC = 0.98). The data and software related to this paper are available at http://www.cs.cmu.edu/~kijungs/codes/mzoom/.

You have full access to this open access chapter, Download conference paper PDF

SpecGreedy: Unified Dense Subgraph Detection

CatchCore: Catching Hierarchical Dense Subtensor

Mining billion-scale tensors: algorithms and discoveries

Article 15 March 2016

Keywords

1 Introduction

Imagine that you manage a social review site (e.g., Yelp) and have the records of which accounts wrote reviews for which restaurants. How do you detect suspicious lockstep behavior: for example, a set of accounts which give fake reviews to the same set of restaurants? What about the case where additional information is present, such as the timestamp of each review, or the keywords in each review?

Such problems of detecting suspicious lockstep behavior have been extensively studied from the perspective of dense subgraph detection. Intuitively, in the above example, highly synchronized behavior induces dense subgraphs in the bipartite review graph of accounts and restaurants. Indeed, methods which detect dense subgraphs have been successfully used to spot fraud in settings ranging from social networks [5, 10, 13, 14], auctions [20], and search engines [8].

Additional information helps identify suspicious lockstep behavior. In the above example, the fact that reviews forming a dense subgraph were also written at about the same time, with the same keywords and number of stars, makes the reviews even more suspicious. A natural and effective way to incorporate such extra information is to model data as a tensor and find dense blocks in it [12, 19].

However, neither existing methods for detecting dense blocks in tensors nor simple extensions of graph-based methods are satisfactory in terms of speed, accuracy, or flexibility. Especially, the types of fraud detectable by each of the methods are limited since, explicitly or implicitly, each method is based on only one density metric, which decides how dense and thus suspicious each block is.

Hence, in this work, we propose M-Zoom (Multidimensional Zoom), a general and flexible framework for detecting dense blocks in tensors. M-Zoom allows for a broad class of density metrics, in addition to having the following strengths:

Scalable: M-Zoom is up to 114 $\times $ faster than state-of-the-art methods with similar accuracy (Fig. 2) thanks to its linear scalability with all aspects of tensors (Fig. 4).
Provably accurate: M-Zoom provides a guarantee on the lowest density of blocks it finds (Theorem 4), as well as shows high accuracy similar with state-of-the-art methods in real-world datasets (Fig. 1a).
Flexible: M-Zoom works successfully with high-order tensors and supports various density measures, multi-block detection, and size bounds (Table 1).
Effective: M-Zoom successfully detected edit wars and bot activities in Wikipedia (Figs. 1c and d), and also detected network attacks with near-perfect accuracy (AUC = 0.98) based on TCP dump data (Fig. 1b).

Reproducibility: Our open-sourced code and the data we used are available at http://www.cs.cmu.edu/~kijungs/codes/mzoom.

Section 2 presents preliminaries and problem definitions. Our proposed M-Zoom is described in Sect. 3 followed by experimental results in Sect. 4. After discussing related work in Sect. 5, we draw conclusions in Sect. 6.

**Table 1. M-Zoom **is flexible.** Comparison between M-Zoom and other methods for dense-block detection. represents ‘supported’.**

2 Preliminaries and Problem Definition

In this section, we introduce definitions and notations used in the paper. We also discuss density measures and give a formal definition of our problems.

2.1 Definitions and Notations

Let $\varvec{\mathcal {R}}(A_{1},A_{2},...,A_{N},X)$ be a relation with N dimension attributes $A_{1}$, $A_{2}$, ..., $A_{N}$, and a nonnegative measure attribute X (see the supplementary document [1] for a running example and its pictorial description). We use $\varvec{\mathcal {R}}_{n}$ to denote the set of distinct values of $A_{n}$ in $\varvec{\mathcal {R}}$, and use $a_{n}\in \varvec{\mathcal {R}}_{n}$ for a value of $A_{n}$. The value of $A_{n}$ in tuple t is denoted by $t[A_{n}]$, and the value of X is denoted by t[X]. The relation $\varvec{\mathcal {R}}$ can be represented as an N-way tensor. In the tensor, each n-th mode has length $|\varvec{\mathcal {R}}_{n}|$, and each cell has the value of attribute X, if the corresponding tuple exists, and 0 otherwise. Let $\varvec{\mathcal {B}}_{n}$ be a subset of $\varvec{\mathcal {R}}_{n}$. Then, we define a block $\varvec{\mathcal {B}}(A_{1},A_{2},...,A_{N}, X)=\{t\in \varvec{\mathcal {R}}: 1 \le \forall n \le N, t[A_{n}]\in \varvec{\mathcal {B}}_{n}\}$, the set of tuples where each dimension attribute $A_{n}$ has a value in $\varvec{\mathcal {B}}_{n}$. $\varvec{\mathcal {B}}$ is called ‘block’ because it forms a subtensor where each n-th mode has length $|\varvec{\mathcal {B}}_{n}|$ in the tensor representation of $\varvec{\mathcal {R}}$. The set of tuples of $\varvec{\mathcal {R}}$ with attribute $A_{n}=a_{n}$ is denoted by $\varvec{\mathcal {R}}(a_{n})=\{t\in \varvec{\mathcal {R}}: t[A_{n}] = a_{n}\}$. We define the mass of $\varvec{\mathcal {R}}$ as $M_{\varvec{\mathcal {R}}}=Mass(\varvec{\mathcal {R}})=\sum _{t\in \varvec{\mathcal {R}}}t[X]$, the sum of the values of attribute X in $\varvec{\mathcal {R}}$. We also define the size of $\varvec{\mathcal {R}}$ as $S_{\varvec{\mathcal {R}}}=Size(\varvec{\mathcal {R}})=\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|$ and the volume of $\varvec{\mathcal {R}}$ as $V_{\varvec{\mathcal {R}}}=Volume(\varvec{\mathcal {R}})=\prod _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|$. Lastly, we use $[x]=\{1,2...,x\}$ for convenience. Table 2 lists frequently used symbols.

Table 2. Table of symbols.

Full size table

2.2 Density Measures

In this paper, we consider three specific density measures although our method is not restricted to them. Two of the density measures (Definitions 1 and 2) are natural multi-dimensional extensions of classic density measures which have been widely used for subgraphs. The merits of the original measures are discussed in [7, 15], and extensive research based on them is discussed in Sect. 5.

Definition 1

(Arithmetic Average Mass [7]). The arithmetic average mass of a block ${\varvec{\mathcal {B}}}$ of a relation ${\varvec{\mathcal {R}}}$ is defined as $\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}/(S_{\varvec{\mathcal {B}}}/N)$.

Definition 2

(Geometric Average Mass [7]). The geometric average mass of a block $\varvec{\mathcal {B}}$ of a relation $\varvec{\mathcal {R}}$ is defined as $\rho _{geo}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {B}}}^{(1/N)}$.

The other density measure (Definition 3) is the negative log likelihood of $M_{\varvec{\mathcal {B}}}$ on the assumption that the value on each cell (in the tensor representation) of $\varvec{\mathcal {R}}$ follows a Poisson distribution. This proved useful in fraud detection [12].

Definition 3

(Suspiciousness [12]). The suspiciousness of a block $\varvec{\mathcal {B}}$ of a relation $\varvec{\mathcal {R}}$ is defined as $\rho _{susp}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})=M_{\varvec{\mathcal {B}}}(\log (M_{\varvec{\mathcal {B}}}/M_{\varvec{\mathcal {R}}})-1)+M_{\varvec{\mathcal {R}}}V_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {R}}}-M_{\varvec{\mathcal {B}}}\log (V_{\varvec{\mathcal {B}}}/V_{\varvec{\mathcal {R}}})$.

Our method, however, is not restricted to the three measures mentioned above. Our method, which searches for dense blocks in a tensor, allows for any density measure $\rho $ that satisfies Axiom 1.

Axiom 1

(Density Axiom). If two blocks of a relation have the same cardinality for every dimension attribute, the block with higher or equal mass is at least as dense as the other. Formally,

$$M_{\varvec{\mathcal {B}}} \ge M_{\varvec{\mathcal {B}}'}\, { and }\, |\varvec{\mathcal {B}}_{n}|=|\varvec{\mathcal {B}}'_{n}|, \forall n\in [N] \Rightarrow \rho (\varvec{\mathcal {B}},\varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}',\varvec{\mathcal {R}}).$$

2.3 Problem Definition

We formally define the problem of detecting the k densest blocks in a tensor.

Problem 1 (k-Densest Blocks)

(1) Given: a relation $\varvec{\mathcal {R}}$, the number of blocks k, and a density measure $\rho $, (2) Find: k distinct blocks of $\varvec{\mathcal {R}}$ with the highest densities in terms of $\rho $.

We also consider a variant of Problem 1 which incorporates lower and upper bounds on the size of the detected blocks. This is particularly useful if the unrestricted densest block is not meaningful due to being too small (e.g. a single tuple) or too large (e.g. the entire tensor).

Problem 2 (k-Densest Blocks with Size Bounds)

(1) Given: a relation $\varvec{\mathcal {R}}$, the number of blocks k, a density measure $\rho $, lower size bound $S_{min}$, and upper size bound $S_{max}$, (2) Find: k distinct blocks of $\varvec{\mathcal {R}}$ with the highest densities in terms of $\rho $ (3) Among: blocks whose sizes are at least $S_{min}$ and at most $S_{max}$.

Even when we restrict our attention to a special case (N=2, k=1, $\rho $=$\rho _{ari}$, $S_{min}$=$S_{max}$), exactly solving Problems 1 and 2 takes $O(S_{\varvec{\mathcal {R}}}^6)$ time [9] and is NP-hard [3], resp., infeasible for large datasets. Thus, we focus on an approximation algorithm which (1) has linear scalability with all aspects of $\varvec{\mathcal {R}}$, (2) provides accuracy guarantees at least for some density measures, and (3) produces meaningful results in real-world datasets, as explained in detail in Sects. 3 and 4.

3 Proposed Method

In this section, we propose M-Zoom (Multidimensional Zoom), a scalable, accurate, and flexible method for finding dense blocks in a tensor. We present the details of M-Zoom in Sect. 3.1 and discuss its efficient implementation in Sect. 3.2. After analyzing the time and space complexity in Sect. 3.3, we prove the quality guarantees provided by M-Zoom in Sect. 3.4.

3.1 Algorithm

Algorithm 1 describes the outline of M-Zoom. M-Zoom first copies the given relation $\varvec{\mathcal {R}}$ and assigns it to $\varvec{\mathcal {R}}^{ori}$ (line 1). Then, M-Zoom finds k dense blocks one by one from $\varvec{\mathcal {R}}$ (line 4). After finding each block from $\varvec{\mathcal {R}}$, M-Zoom removes the tuples in the block from $\varvec{\mathcal {R}}$ to prevent the same block from being found again (line 5). Due to these changes in $\varvec{\mathcal {R}}$, a block found in $\varvec{\mathcal {R}}$ is not necessarily a block of the original relation $\varvec{\mathcal {R}}^{ori}$. Thus, instead of returning the blocks found in $\varvec{\mathcal {R}}$, M-Zoom returns the blocks of $\varvec{\mathcal {R}}^{ori}$ consisting of the same attribute values with the found blocks (lines 6–7). This also enables M-Zoom to find overlapped blocks, i.e., a tuple can be included in two or more blocks.

Algorithm 2 describes how M-Zoom finds a single dense block from the given relation $\varvec{\mathcal {R}}$. The block $\varvec{\mathcal {B}}$ is initialized to $\varvec{\mathcal {R}}$ (lines 1–2). From $\varvec{\mathcal {B}}$, M-Zoom removes attribute values one by one in a greedy way until no attribute value is left (line 4). Specifically, M-Zoom finds the attribute value $a_{i}$ that maximizes $\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})$, which corresponds to the density when tuples with $A_{i}=a_{i}$ are removed from $\varvec{\mathcal {B}}$ (line 7). Then, the attribute value, denoted by $a^{*}_{i}$, and the tuples with $A_{i}=a^{*}_{i}$ are removed from $\varvec{\mathcal {B}}_{i}$ and $\varvec{\mathcal {B}}$, respectively (lines 8–9). Before removing each attribute value, M-Zoom adds the current $\varvec{\mathcal {B}}$ to the snapshot list if $\varvec{\mathcal {B}}$ satisfies the size bound (i.e., $S_{min}\le S_{\varvec{\mathcal {B}}} \le S_{max}$) (lines 5–6). As the final step of finding a block, M-Zoom returns the block with the maximum density among those in the snapshot list (line 10).

3.2 Efficient Implementation of M-Zoom

In this section, we discuss an efficient implementation of M-Zoom focusing on the greedy attribute value selection and the densest block selection.

Attribute Value Selection Using Min-Heaps. Finding the attribute value $a_{i} \in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}$ that maximizes $\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})$ (line 7 of Algorithm 2) can be computationally very expensive if all possible attribute values (i.e., $\bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}$) should be considered. However, due to Axiom 1, which is assumed to be satisfied by considered density measures, the number of candidates is reduced to N if $M_{\varvec{\mathcal {B}}(a_{i})}$ is known for each attribute value $a_{i}$. Lemma 1 states this.

Lemma 1

If we remove a value of attribute $A_{n}$ from $\varvec{\mathcal {B}}_{n}$, removing $a_{n}\in \varvec{\mathcal {B}}_{n}$ with minimum $M_{\varvec{\mathcal {B}}(a_{n})}$ results in the highest density. Formally,

$$M_{\varvec{\mathcal {B}}(a'_{n})} \le M_{\varvec{\mathcal {B}}(a_{n})}, \forall a_{n}\in \varvec{\mathcal {B}}_{n} \Rightarrow \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a'_{n}), \varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a_{n}), \varvec{\mathcal {R}}), \forall a_{n}\in \varvec{\mathcal {B}}_{n}. $$

Proof

Let $\varvec{\mathcal {B}}'=\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a'_{n})$ and $\varvec{\mathcal {B}}''=\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{n})$. Then, $|\varvec{\mathcal {B}}'_{n}|=|\varvec{\mathcal {B}}''_{n}|, \forall n\in [N]$. In addition, $M_{\varvec{\mathcal {B}}'} \ge M_{\varvec{\mathcal {B}}''}$ since $M_{\varvec{\mathcal {B}}'}=M_{\varvec{\mathcal {B}}} - M_{\varvec{\mathcal {B}}(a'_{n})} \ge M_{\varvec{\mathcal {B}}} - M_{\varvec{\mathcal {B}}(a_{n})}=M_{\varvec{\mathcal {B}}''}$. Hence, by Axiom 1, $\rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a'_{n}),\varvec{\mathcal {R}}) \ge \rho (\varvec{\mathcal {B}}- \varvec{\mathcal {B}}(a_{n}),\varvec{\mathcal {R}})$. $\quad \square $

By Lemma 1, if we let $a'_{n}$ be $a_{n}\in \varvec{\mathcal {B}}_{n}$ with minimum $M_{\varvec{\mathcal {B}}(a_{n})}$, we only have to consider values in $\{a'_{n}\}_{n=1}^{N}$ instead of $\bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}$ to find the attribute value maximizing density when it is removed. To exploit this, our implementation of M-Zoom maintains a min-heap for each attribute $A_{n}$ where the key of each value $a_{n}$ is $M_{\varvec{\mathcal {B}}(a_{n})}$. This key is updated, which takes O(1) if Fibonacci Heaps are used as min-heaps, whenever the tuples with the corresponding attribute value are removed. Algorithm 3 describes in detail how to find the attribute value to be removed based on these min-heaps, and update keys in them. Since Algorithm 3 considers all promising attribute values (i.e., $\{a'_{n}\}_{n=1}^{N}$), it is guaranteed to find the value that maximizes density when it is removed, as Theorem 1 states.

Theorem 1

Algorithm 3 returns $a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}$ with maximum $\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})$.

Proof

Let $a_{i}^{*}$ be $a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}_{n}$ with maximum $\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a_{i}), \varvec{\mathcal {R}})$. By Lemma 1, $a_{i}^{*}$ exists among $\{a'_{n}\}_{n=1}^{N}$, all of which are considered in Algorithm 3. $\quad \square $

Densest Block Selection Using Attribute Value Ordering. As explained in Sect. 3.1, M-Zoom returns the densest block among snapshots of $\varvec{\mathcal {B}}$ (line 10 of Algorithm 2). Explicitly maintaining the list of snapshots, whose length is at most $S_{\varvec{\mathcal {R}}}$, requires $O(N|\varvec{\mathcal {R}}|S_{\varvec{\mathcal {R}}})$ computation and space for copying them. Even maintaining only the current best (i.e., the one with the highest density so far) cannot avoid high computational cost if the current best keeps changing. Instead, our implementation maintains the order by which attribute values are removed as well as the iteration where the density was maximized, which requires only $O(S_{\varvec{\mathcal {R}}})$ space. From these and the original relation $\varvec{\mathcal {R}}$, our implementation restores the snapshot with maximum density in $O(N|\varvec{\mathcal {R}}|+S_{\varvec{\mathcal {R}}})$ time and returns it.

3.3 Complexity Analysis

The time and space complexity of M-Zoom depend on the density measure used. In this section, we assume that one of the density measures in Sect. 2.2, which satisfy Axiom 1, is used.

Theorem 2

The time complexity of Algorithm 1 is $O(kN|\varvec{\mathcal {R}}|\log L)$ if $|\varvec{\mathcal {R}}_{n}|=L$, $\forall n\in [N]$, and $N = O(\log L)$.

Proof

See Appendix B.

As stated in Theorem 2, M-Zoom scales linearly or sub-linearly with all aspects of relation $\varvec{\mathcal {R}}$ as well as k, the number of blocks we aim to find. This result is also experimentally supported in Sect. 4.4. In our experiments, the actual running time scaled sub-linearly with k as well as L since the number of tuples in $\varvec{\mathcal {R}}$ decreases as M-Zoom finds blocks (line 5 in Algorithm 1).

Theorem 3

The space complexity of Algorithm 1 is $O(kN|\varvec{\mathcal {R}}|)$.

Proof

See the supplementary document [1]. $\quad \square $

M-Zoom requires up to $kN|\varvec{\mathcal {R}}|$ space for storing k found blocks, as stated in Theorem 3. However, since the blocks are usually far smaller than $\varvec{\mathcal {R}}$, as seen in Tables 4 and 5 in Sect. 4, actual space usage is much less than $kN|\varvec{\mathcal {R}}|$.

3.4 Accuracy Guarantee

In this section, we show lower bounds on the densities of the blocks found by M-Zoom on the assumption that $\rho _{ari}$ (Definition 1) is used as the density measure. Specifically, we show that Algorithm 2 without size bounds is guaranteed to find a block with density at least 1/N of maximum density in the given relation (Theorem 4). This means that each n-th block returned by Algorithm 1 has density at least 1/N of maximum density in $\varvec{\mathcal {R}}-\bigcup _{i=1}^{n-1}(i$-th block). Let $\varvec{\mathcal {B}}^{(r)}$ be the relation $\varvec{\mathcal {B}}$ at the beginning of the r-th iteration of Algorithm 2, and $a_{i}^{(r)} \in \varvec{\mathcal {B}}^{(r)}_{i}$ be the attribute value removed in the same iteration.

Lemma 2

If a block $\varvec{\mathcal {B}}'$ satisfying $\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}'_{n}$, $M_{\varvec{\mathcal {B}}'(a_{i})}\ge c$ exists, there exists $\varvec{\mathcal {B}}^{(r)}$ satisfying $\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$, $M_{\varvec{\mathcal {B}}^{(r)}(a_{i})}\ge c$.

Proof

See Appendix C. $\quad \square $

Theorem 4

(1/N -Approximation Guarantee for Problem 1 ). Given a relation $\varvec{\mathcal {R}}$, let $\varvec{\mathcal {B}}^{*}$ be the block $\varvec{\mathcal {B}}\subset \varvec{\mathcal {R}}$ with maximum $\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})$. Let $\varvec{\mathcal {B}}'$ be the block obtained by Algorithm 2 without size bounds (i.e., $S_{min}=0$ and $S_{max}=\infty $). Then, $\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N$.

Proof

$\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{*}_{n}$, $M_{\varvec{\mathcal {B}}^{*}(a_{i})}$ $\ge M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}$. Otherwise, a contradiction would result since for $a_{i}$ with $M_{\varvec{\mathcal {B}}^{*}(a_{i})}< M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}$,

$$ \rho _{ari}(\varvec{\mathcal {B}}^{*}-\varvec{\mathcal {B}}^{*}(a_{i}),\varvec{\mathcal {R}}) = \frac{M_{\varvec{\mathcal {B}}^{*}}-M_{\varvec{\mathcal {B}}^{*}(a_{i})}}{(S_{\varvec{\mathcal {B}}^{*}}-1)/N}> \frac{M_{\varvec{\mathcal {B}}^{*}}-M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}}{(S_{\varvec{\mathcal {B}}^{*}}-1)/N}=\rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}}). $$

Consider $\varvec{\mathcal {B}}^{(r)}$ where $\forall a_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$, $M_{\varvec{\mathcal {B}}^{(r)}(a_{i})}\ge M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}}$. Such $\varvec{\mathcal {B}}^{(r)}$ exists by Lemma 2. $M_{\varvec{\mathcal {B}}^{(r)}}\ge (S_{\varvec{\mathcal {B}}^{(r)}}/N)$ $(M_{\varvec{\mathcal {B}}^{*}}/S_{\varvec{\mathcal {B}}^{*}})=(S_{\varvec{\mathcal {B}}^{(r)}}/N)(\rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N)$. Hence, $\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{(r)},\varvec{\mathcal {R}}) = M_{\varvec{\mathcal {B}}^{(r)}}/(S_{\varvec{\mathcal {B}}^{(r)}}/N) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/N.$ $\quad \square $

Theorem 4 can be extended to cases where a lower bound exists. In these cases, the approximate factor is 1/(N+1), as stated in Theorem 5.

Theorem 5

( $1/(N+1)$ -Approximation Guarantee for Problem 2 ). Given a relation $\varvec{\mathcal {R}}$, let $\varvec{\mathcal {B}}^{*}$ be the block $\varvec{\mathcal {B}}\subset \varvec{\mathcal {R}}$ with maximum $\rho _{ari}(\varvec{\mathcal {B}},\varvec{\mathcal {R}})$ among blocks with size at least $S_{min}$. Let $\varvec{\mathcal {B}}'$ be the block obtained by Algorithm 2 with lower size bound (i.e., $1 \le S_{min}\le S_{\varvec{\mathcal {R}}}$ and $S_{max}=\infty $). Then, $\rho _{ari}(\varvec{\mathcal {B}}',\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{*},\varvec{\mathcal {R}})/(N+1)$.

Proof

See the supplementary document [1]. $\quad \square $

4 Experiments

We designed and performed experiments to answer the following questions:

Q1. How fast and accurately does M-Zoom detect dense blocks in real data?
Q2. Does M-Zoom find many different dense blocks in real data?
Q3. Does M-Zoom scale linearly with all aspects of data?
Q4. Which anomalies or fraud does M-Zoom spot in real data?

Table 3. Summary of real-world datasets.

Full size table

4.1 Experimental Settings

All experiments were conducted on a machine with 2.67 GHz Intel Xeon E7-8837 CPUs and 1TB RAM. We compared M-Zoom with CrossSpot [12], CP Decomposition (CPD) [17] (see Appendix A for details), and MultiAspectForensics (MAF) [19]. M-Zoom and CrossSpot ^{Footnote 1} were implemented in Java, and Tensor Toolbox [4] was used for CPD and MAF. Although CrossSpot was originally designed to maximize $\rho _{susp}$, it can be extended to other density measures. These variants were used depending on the density measure compared in each experiment. In addition, we used CPD as a seed selection method of CrossSpot, which outperformed HOSVD used in [12] in terms of both speed and accuracy. We used diverse real-world datasets, grouped as follows:

User behavior logs: StackO.(user,post,timestamp,1) represents who marked which post as a favorite when on Stack Overflow. Youtube(user,user,date,1) represents who became a friend of whom when on Youtube. KoWiki(user,page, timestamp,#revisions) and EnWiki(user,page,timestamp,#revisions) represent who revised which page when how many times on Korean Wikipedia and English Wikipedia, respectively.
User reviews: Yelp(user,business,date,score,1), Netflix(user,movie,date,sco-re,1), and YahooM.(user,item,timestamp,score,1) represent who gave which score when to which business, movie, and item on Yelp, Netflix, and Yahoo Music, respectively.
TCP dumps: From TCP dump data for a typical U.S. Air Force LAN, we created a relation AirForce(protocol,service,src_bytes,dst_bytes,flag,host_count ,src_count,#connections). See the supplementary document [1] for the description of each attribute.

Timestamps are in hours in all the datasets. Table 3 summarizes all the datasets.

4.2 Q1. Running Time and Accuracy of M-Zoom

We compare the speed of different methods and the densities of the blocks found by the methods in real-world datasets. Specifically, we measured time taken to find three blocks and the maximum density among the three blocks. Figure 2 shows the result when $\rho _{ari}$ was used as the density measure. M-Zoom clearly provided the best trade-off between speed and accuracy in all datasets. For example, in YahooM. Dataset, M-Zoom was 114 times faster than CrossSpot, while detecting blocks with similar densities. Compared with CPD, M-Zoom detected two times denser blocks 2.8 times faster. Although the results are not included in Fig. 2, MAF found several orders of magnitude sparser blocks than the other methods, with speed similar to that of CPD. M-Zoom also gave the best trade-off between speed and accuracy when $\rho _{geo}$ or $\rho _{susp}$ was used instead of $\rho _{ari}$ (see the supplementary document [1]).

4.3 Q2. Diversity of Blocks Found by M-Zoom

We compare the diversity of dense blocks found by each method. Ability to detect many different dense blocks is useful since distinct blocks may indicate different anomalies or fraud. We define the diversity as the average dissimilarity between the pairs of blocks, and the dissimilarity of two blocks is defined as $dissimilarity(\varvec{\mathcal {B}}, \varvec{\mathcal {B}}') = 1-\frac{ |(\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}_{n}}) \cap (\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}'_{n}})|}{|(\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}_{n}}) \cup (\bigcup {}_{n=1}^{N}{\varvec{\mathcal {B}}'_{n}})|}.$ Diversities were measured among three blocks found by each method using $\rho _{ari}$ as the density metric.

As seen in Fig. 3, in all datasets, M-Zoom and CPD successfully detected distinct dense blocks. CrossSpot, however, found the same block repeatedly or blocks with slight difference, even when it started from different seed blocks. Although using CPD for seed-block selection in CrossSpot improved the diversity, the effect was limited in most datasets. Similar results were obtained when $\rho _{geo}$ or $\rho _{susp}$ was used instead of $\rho _{ari}$ (see the supplementary document [1]).

4.4 Q3. Scalability of M-Zoom

We empirically demonstrate the scalability of M-Zoom, mathematically analyzed in Theorem 2. Specifically, we measured the scalability of M-Zoom with regard to the number of tuples, the number of attributes, the cardinalities of attributes, and the number of blocks we aim to find. We started with finding one block in a randomly generated 10 millions tuples with three attributes each of whose cardinality is 100 K. Then, we measured the running time by changing one factor at a time while fixing the others. As seen in Fig. 4, M-Zoom scaled linearly with the number of tuples and the number of attributes. Moreover, M-Zoom scaled sub-linearly with the number of blocks we aim to find as well as the cardinalities of attributes due to the reason explained in Sect. 3.3. These results held regardless of the density measure used.

4.5 Q4. Anomaly/Fraud Detection by M-Zoom in Real Data

We demonstrate the effectiveness of M-Zoom for anomaly and fraud detection by analyzing dense blocks detected by M-Zoom in real-world datasets.

M-Zoom spots edit wars and bot activities in Wikipedia. Table 4 lists the first three dense blocks found by M-Zoom in EnWiki and KoWiki Datasets. As seen in the third dense block visualized in Fig. 1c, the dense blocks detected in KoWiki Dataset indicate edit wars. That is, users with conflicting opinions revised the same set of pages hundreds of times within several hours. On the other hand, the dense blocks detected in EnWiki Dataset indicate the activities of bots, which changed the same pages hundreds of thousands of times. Figure 1d lists the bots and pages corresponding to the second found block.

Table 4. M-Zoom detects anomalous behaviors in Wikipedia. The tables list the first three blocks detected by M-Zoom in KoWiki and EnWiki Datasets, which correspond to edit wars and bot activities, respectively.

Full size table

Table 5. M-Zoom identifies network attacks with near-perfect accuracy. The first three blocks found by M-Zoom in AirForce Dataset consist of attacks.

Full size table

M-Zoom spots network intrusions. Table 5 lists the first three blocks found by M-Zoom in AirForce Dataset. Based on the provided ground truth labels, all of the about 3 millions connections composing the blocks were attacks except only one normal connection. This indicates that malicious connections form dense blocks due to the similarity in their behaviors. Based on this observation, we could accurately separate normal connections and attacks based on the densities of blocks they belong (i.e., the denser block a connection belongs, the more suspicious it is). Especially, we got the highest AUC (Area under the curve) 0.98 with M-Zoom, as shown in Fig. 1b, because M-Zoom detects many different dense blocks accurately, as shown in previous experiments. For each method, we used the best density measure that leads to the highest AUC.

5 Related Work

Dense Subgraph/Submatrix/Subtensor Detection. The densest subgraph problem, the problem of finding the subgraph which maximizes $\rho _{ari}$ or $\rho _{geo}$ (see Definitions 1 and 2), has been extensively studied in theory (see [18] for surveys). The two major directions are max-flow based exact algorithms [9, 16] and greedy algorithms [7, 16] giving a 1/2-approximation to the densest subgraph. Variants allow for size restrictions [3], providing a 1/3-approximation to the densest subgraph for the lower bound case. Another related line of research deals with dense blocks in binary matrices or tensors where the definition of density is designed for the purpose of frequent itemset mining [22] or formal concept mining [6, 11].

Anomaly/Fraud Detection based on Dense Subgraphs. Spectral approaches make use of eigendecomposition or SVD of the adjacency matrix for dense-block detection. Such approaches have been used to spot anomalous pattens in a patent graph [21], lockstep followers in a social network [14], and stealthy or small-scale attacks in social networks [23]. Other approaches include NetProbe [20], which used belief propagation to detect fraud-accomplice bipartite cores in an auction network, and CopyCatch [5], which used one-class clustering and sub-space clustering to identify “Like” boosting in Facebook. In addition, OddBall [2] spotted near-cliques in links among posts in blogs based on egonet features. Recently, Fraudar [10], which generalizes densest subgraph-detection methods so that the suspiciousness of nodes and edges can be incorporated, spotted follower-buying services in Twitter.

Anomaly/Fraud Detection based on Dense Subtensors. Spectral methods for dense subgraphs can be extended to tensors where tensor decomposition, such as CP Decomposition and HOSVD [17], is used to spot dense subtensors. MAF [19], which is based on CP Decomposition, detected dense blocks corresponding to port-scanning activities based on network traffic logs. Another approach is CrossSpot [12], which finds dense blocks by starting from seed blocks and growing them in a greedy way until $\rho _{susp}$ (see Definition 3) converges. CrossSpot spotted retweet boosting in Weibo, outperforming HOSVD.

Our M-Zoom non-trivially generalizes theoretical results regarding the densest subgraph problem, especially [3], for supporting tensors, various density measures, and multi-block detection. As seen in Table 1, M-Zoom provides more flexibility than other methods for dense-block detection.

6 Conclusion

In this work, we propose M-Zoom, a flexible framework for finding dense blocks in tensors, which has the following advantages over state-of-the-art methods:

Scalable: M-Zoom is up to 114 $\times $ faster than competitors with similar accuracy due to its linear scalability with all input factors (Figs. 2 and 4).
Provably accurate: M-Zoom provides lower bounds on the densities of the blocks it finds (Theorem 4) as well as high accuracy in real data (Fig. 2).
Flexible: M-Zoom supports high-order tensors, various density measures, multi-block detection, and size bounds (Table 1).
Effective: M-Zoom successfully detected fraud based on a TCP dump with near-perfect accuracy (AUC = 0.98), and anomalies in Wikipedia (Fig. 1).

Reproducibility: Our open-sourced code and the data we used are at http://www.cs.cmu.edu/~kijungs/codes/mzoom.

Notes

1.
We referred the open-sourced implementation at http://github.com/mjiang89/CrossSpot.

References

Supplementary document (examples, proofs, and additional experiments). http://www.cs.cmu.edu/~kijungs/codes/mzoom/supple.pdf
Akoglu, L., McGlohon, M., Faloutsos, C.: Oddball: spotting anomalies in weighted graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6119, pp. 410–421. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13672-6_40
Chapter Google Scholar
Andersen, R., Chellapilla, K.: Finding dense subgraphs with size bounds. In: Kumar, R., Sivakumar, D. (eds.) WAW 2010. LNCS, vol. 6516, pp. 25–37. Springer, Heidelberg (2009). doi:10.1007/978-3-540-95995-3_3
Chapter Google Scholar
Bader, B.W., Kolda, T.G., et al.: Matlab tensor toolbox version 2.6. http://www.sandia.gov/~tgkolda/TensorToolbox/
Beutel, A., Xu, W., Guruswami, V., Palow, C., Faloutsos, C.: Copycatch: stopping group attacks by spotting lockstep behavior in social networks. In: WWW (2013)
Google Scholar
Cerf, L., Besson, J., Robardet, C., Boulicaut, J.F.: Data peeler: constraint-based closed pattern mining in n-ary relations. In: SDM (2008)
Google Scholar
Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Jansen, K., Leonardi, S., Vazirani, V. (eds.) APPROX 2002. LNCS, vol. 2462, pp. 84–95. Springer, Heidelberg (2000). doi:10.1007/3-540-44436-X_10
Chapter Google Scholar
Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: VLDB (2005)
Google Scholar
Goldberg, A.V.: Finding a maximum density subgraph. Technical report (1984)
Google Scholar
Hooi, B., Song, H.A., Beutel, A., Shah, N., Shin, K., Faloutsos, C.: Fraudar: bounding graph fraud in the face of camouflage. In: KDD (2016)
Google Scholar
Ignatov, D.I., Kuznetsov, S.O., Poelmans, J., Zhukov, L.E.: Can triconcepts become triclusters? Int. J. Gen. Syst. 42(6), 572–593 (2013)
Article MathSciNet MATH Google Scholar
Jiang, M., Beutel, A., Cui, P., Hooi, B., Yang, S., Faloutsos, C.: A general suspiciousness metric for dense blocks in multimodal data. In: ICDM (2015)
Google Scholar
Jiang, M., Cui, P., Beutel, A., Faloutsos, C., Yang, S.: Catchsync: catching synchronized behavior in large directed graphs. In: KDD (2014)
Google Scholar
Jiang, M., Cui, P., Beutel, A., Faloutsos, C., Yang, S.: Inferring strange behavior from connectivity pattern in social networks. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8443, pp. 126–138. Springer, Heidelberg (2014). doi:10.1007/978-3-319-06608-0_11
Chapter Google Scholar
Kannan, R., Vinay, V.: Analyzing the structure of large graphs. Technical report (1999)
Google Scholar
Khuller, S., Saha, B.: On finding dense subgraphs. In: Loeckx, J. (ed.) ICALP 1974. LNCS, vol. 14, pp. 597–608. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02927-1_50
Chapter Google Scholar
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet MATH Google Scholar
Lee, V.E., Ruan, N., Jin, R., Aggarwal, C.: A survey of algorithms for dense subgraph discovery. In: Aggarwal, C.C., Wang, H. (eds.) Managing and Mining Graph Data, vol. 40, pp. 303–336. Springer, Heidelberg (2010)
Chapter Google Scholar
Maruhashi, K., Guo, F., Faloutsos, C.: Multiaspectforensics: pattern mining on large-scale heterogeneous networks with tensor analysis. In: ASONAM (2011)
Google Scholar
Pandit, S., Chau, D.H., Wang, S., Faloutsos, C.: Netprobe: a fast and scalable system for fraud detection in online auction networks. In: WWW (2007)
Google Scholar
Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: EigenSpokes: surprising patterns and scalable community chipping in large graphs. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6119, pp. 435–448. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13672-6_42
Chapter Google Scholar
Seppänen, J.K., Mannila, H.: Dense itemsets. In: KDD (2004)
Google Scholar
Shah, N., Beutel, A., Gallagher, B., Faloutsos, C.: Spotting suspicious link behavior with fbox: an adversarial perspective. In: ICDM (2014)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. CNS-1314632 and IIS-1408924. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Kijung Shin, Bryan Hooi & Christos Faloutsos

Authors

Kijung Shin
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Hooi
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kijung Shin .

Editor information

Editors and Affiliations

Università degli Studi di Firenze, Firenze, Italy
Paolo Frasconi
Computer Science, University of Potsdam, Potsdam, Germany
Niels Landwehr
High Performance Computing and Networks, Rende, Italy
Giuseppe Manco
MPI for Informatics, Saarland University, Saarbrücken, Saarland, Germany
Jilles Vreeken

Appendices

A CP Decomposition (CPD)

In a graph, dense subgraphs lead to high singular values of the adjacency matrix [23]. The singular vectors corresponding to the high singular values roughly indicate which nodes form dense blocks. This idea can be extended to tensors, where dense blocks are captured by components in CP Decomposition [17]. Let $\mathbf {A}^{(1)}\in \mathbb {R}^{|\varvec{\mathcal {R}}_{1}|\times k}$, $\mathbf {A}^{(2)}\in \mathbb {R}^{|\varvec{\mathcal {R}}_{2}|\times k}$, ..., $\mathbf {A}^{(N)}\in \mathbb {R}^{|\varvec{\mathcal {R}}_{N}|\times k}$ be the factor matrices obtained by the rank-k CP Decomposition of $\varvec{\mathcal {R}}$. For each $i\in [k]$, we form a block with every attribute value $a_{n}$ whose corresponding element in the i-th column of $\mathbf {A}^{(n)}$ is at least $1/\sqrt{|\varvec{\mathcal {R}}_{n}|}$.

B Proof of Theorem 2

Proof

In Algorithm 3, lines 1–3 take O(N) for all the density measures considered (i.e., $\rho _{ari}$, $\rho _{geo}$, and $\rho _{susp}$) if we maintain and update aggregated values (e.g., $M_{B}$, $S_{B}$, and $V_{B}$) instead of computing $\rho (\varvec{\mathcal {B}}-\varvec{\mathcal {B}}(a'_{i}), \varvec{\mathcal {R}})$ from scratch every time. In addition, line 4 takes $O(\log |\varvec{\mathcal {R}}_{n}|)$ and lines 5–7 take $O(N|\varvec{\mathcal {B}}(a^{*}_{i})|)$ if we use Fibonacci heaps. Algorithm 2, whose computational bottleneck is line 7, has time complexity $O(N|\varvec{\mathcal {R}}|+N\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|+\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|\log |\varvec{\mathcal {R}}_{n}|))$ since lines 1–4 of Algorithm 3 are executed $S_{\varvec{\mathcal {R}}}=\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|$ times, and line 7 is executed $N|\varvec{\mathcal {R}}|$ times. Algorithm 1, whose computational bottleneck is line 4, has time complexity $O(kN|\varvec{\mathcal {R}}|+kN\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|+k\sum _{n=1}^{N}|\varvec{\mathcal {R}}_{n}|\log |\varvec{\mathcal {R}}_{n}|))$ since Algorithm 2 is executed k times.

Assume $|\varvec{\mathcal {R}}_{n}|=L$, $\forall n\in [N]$, and $N = O(\log L)$. The time complexity of Algorithm 1 becomes $O(kN(|\varvec{\mathcal {R}}|+NL+L\log L))$. Since $N = O(\log L)$, by assumption, and $L\le |\varvec{\mathcal {R}}|$, there exists a constant c such that $|\varvec{\mathcal {R}}|+NL+L\log L \le c|\varvec{\mathcal {R}}|\log L=O(|\varvec{\mathcal {R}}|\log L)$. Thus, the time complexity of Algorithm 1 is $O(kN|\varvec{\mathcal {R}}|\log L)$. $\quad \square $

C Proof of Lemma 2

Lemma 3

$a_{i}^{(r)}$ minimizes $Mass(\varvec{\mathcal {B}}^{(r)}(a_{j}))$ among $a_{j}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$.

Proof

From Theorem 1, $\rho _{ari}(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a^{(r)}_{i}),\varvec{\mathcal {R}}) \ge \rho _{ari}(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a_{j}),\varvec{\mathcal {R}})$, $\forall a_{j}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$. Thus, $Mass(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a_{i}^{(r)})) =\rho _{ari}(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a^{(r)}_{i}),\varvec{\mathcal {R}}) $ $(Size(\varvec{\mathcal {B}}^{(r)})-1)/N \ge \rho _{ari}(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a_{j}), \varvec{\mathcal {R}})(Size(\varvec{\mathcal {B}}^{(r)})-1)/N=Mass(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a_{j}))$. Then, $Mass(\varvec{\mathcal {B}}^{(r)}(a_{i}^{(r)}))$ $=Mass(\varvec{\mathcal {B}}^{(r)})-Mass(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a^{(r)}_{i})) \le Mass(\varvec{\mathcal {B}}^{(r)})-Mass(\varvec{\mathcal {B}}^{(r)}-\varvec{\mathcal {B}}^{(r)}(a_{j}))=Mass(\varvec{\mathcal {B}}^{(r)}(a_{j}))$, $\forall a_{j}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$. $\quad \square $

Proof of Lemma 2 .

Proof

Let r be the first iteration in Algorithm 2 where $a^{(r)}_{i}\in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}'_{n}$. Since $\varvec{\mathcal {B}}^{(r)} \supset \varvec{\mathcal {B}}'$, $Mass(\varvec{\mathcal {B}}^{(r)}(a^{(r)}_{i}))\ge Mass(\varvec{\mathcal {B}}'(a^{(r)}_{i}))\ge c$. By Lemma 3, $\forall a_{j} \in \bigcup _{n=1}^{N}\varvec{\mathcal {B}}^{(r)}_{n}$, $Mass(\varvec{\mathcal {B}}^{(r)}(a_{j}))\ge Mass(\varvec{\mathcal {B}}^{(r)}(a^{(r)}_{i})) \ge c$. $\quad \square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shin, K., Hooi, B., Faloutsos, C. (2016). M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9851. Springer, Cham. https://doi.org/10.1007/978-3-319-46128-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-46128-1_17
Published: 04 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46127-4
Online ISBN: 978-3-319-46128-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

M-Zoom: Fast Dense-Block Detection in Tensors with Quality Guarantees

Abstract

Similar content being viewed by others

SpecGreedy: Unified Dense Subgraph Detection

CatchCore: Catching Hierarchical Dense Subtensor

Mining billion-scale tensors: algorithms and discoveries

Keywords

1 Introduction

2 Preliminaries and Problem Definition

2.1 Definitions and Notations

2.2 Density Measures

Definition 1

Definition 2

Definition 3

Axiom 1

2.3 Problem Definition

Problem 1 (k-Densest Blocks)

Problem 2 (k-Densest Blocks with Size Bounds)

3 Proposed Method

3.1 Algorithm

3.2 Efficient Implementation of M-Zoom

Lemma 1

Proof

Theorem 1

Proof

3.3 Complexity Analysis

Theorem 2

Proof

Theorem 3

Proof

3.4 Accuracy Guarantee

Lemma 2

Proof

Theorem 4

Proof

Theorem 5

Proof

4 Experiments

4.1 Experimental Settings

4.2 Q1. Running Time and Accuracy of M-Zoom

4.3 Q2. Diversity of Blocks Found by M-Zoom

4.4 Q3. Scalability of M-Zoom

4.5 Q4. Anomaly/Fraud Detection by M-Zoom in Real Data

5 Related Work

6 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A CP Decomposition (CPD)

B Proof of Theorem 2

Proof

C Proof of Lemma 2

Lemma 3

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation