Mining skypatterns in fuzzy tensors

Abstract

Many data mining tasks rely on pattern mining. To identify the patterns of interest in a dataset, an analyst may define several measures that score, in different ways, the relevance of a pattern. Until recently, most algorithms have only handled constraints in an efficient way, i.e., every measure had to be associated with a user-defined threshold, which can be tricky to determine. Skypatterns were introduced to allow analysts to simply define the measures of interest, and to get as a result a set of globally optimal and semantically relevant patterns. Skypatterns are Pareto-optimal patterns: no other pattern scores better on one of the chosen measures and scores at least as well on every remaining measure. This article tackles the search of the skypatterns in a more general context than the 0/1 (aka Boolean) matrix: the fuzzy tensor. The proposed solution supports a large class of measures. After explaining why and how their common mathematical property enables a safe pruning of the search space, an algorithm is presented. It builds upon multidupehack, a generalist pattern mining framework, which is now able to efficiently list skypatterns in addition to enforcing constraints on them. Experiments on two real-world fuzzy tensors illustrate the versatility of the proposal. Other experiments show it is typically more than one order of magnitude faster than the state-of-the-art algorithms, which can only mine 0/1 matrices.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    ET-n-set stands for Error-Tolerantn-set.

  2. 2.

    https://gitlab.com/nnadisic/skypatterns-uncertain-tensors.

  3. 3.

    http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

References

  1. Bistarelli S, Bonchi F (2007) Soft constraint based pattern mining. Data Knowl Eng 62(1):118–137

    Article  Google Scholar 

  2. Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: ICDE’01: proceedings of the 17th international conference on data engineering. IEEE Computer Society, pp 421–430

  3. Cerf L, Meira Jr. W (2014) Complete discovery of high-quality patterns in large numerical tensors. In: ICDE’14: proceedings of the 30th international conference on data engineering. IEEE Computer Society, pp 448–459

  4. Cerf L, Besson J, Robardet C, Boulicaut J-F (2009) Closed patterns meet $n$-ary relations. ACM Trans Knowl Discov Data 3(1):1–36

    Article  Google Scholar 

  5. Coussat A, Nadisic N, Cerf L (2018) Mining high-utility patterns in uncertain tensors. In: KES’18: proceedings of the 22nd international conference on knowledge-based and intelligent information & engineering systems. Elsevier, pp 403–412

  6. Goyal V, Sureka A, Patel D (2015) Efficient skyline itemsets mining. In: C3S2E’15: proceedings of the eighth international C* conference on computer science & software engineering. ACM Press, pp 119–124

  7. Lin JC-W, Yang L, Fournier-Viger P, Dawar S, Goyal V, Sureka A, Vo B (2016) A more efficient algorithm to mine skyline frequent-utility patterns. In: ICGEC’16: proceedings of the tenth international conference on genetic and evolutionary computing, pp 127–135

  8. Négrevergne B, Dries A, Guns T, Nijssen S (2013) Dominance programming for itemset mining. In: ICDM’13: proceedings of the 13th international conference on data mining. IEEE Computer Society, pp 557–566

  9. Papadopoulos AN, Lyritsis A, Manolopoulos Y (2008) SkyGraph: an algorithm for important subgraph discovery in relational graphs. Data Min Knowl Discov 17(1):57–76

    MathSciNet  Article  Google Scholar 

  10. Soulet A, Crémilleux B (2005) Exploiting virtual patterns for automatically pruning the search space. In: KDID’05: Proceedings of the fourth international workshop on knowledge discovery in inductive databases. Springer, pp 202–221

  11. Soulet A, Crémilleux B (2009) Mining constraint-based patterns using automatic relaxation. Intell Data Anal 13(1):109–133

    Article  Google Scholar 

  12. Soulet A, Raïssi C, Plantevit M, Crémilleux B (2011) Mining dominant patterns in the sky. In: ICDM’11: proceedings of the 11th international conference on data mining. IEEE Computer Society, pp 655–664

  13. Ugarte W, Boizumault P, Loudni S, Crémilleux B (2014a) Computing skypattern cubes. In: ECAI’14: proceedings of the 21st European conference on artificial intelligence. IOS Press, pp 903–908

  14. Ugarte W, Boizumault P, Loudni S, Crémilleux B, Lepailleur A (2014b) Mining (soft-) skypatterns using dynamic CSP. In: CPAIOR’14: proceedings of the 11th international conference on integration of AI and OR techniques in constraint programming. Springer, pp 71–87

  15. Ugarte W, Boizumault P, Crémilleux B, Lepailleur A, Loudni S, Plantevit M, Raïssi C, Soulet A (2017) Skypattern mining: from pattern condensed representations to dynamic constraint satisfaction problems. Artif Intell 244:48–69

    MathSciNet  Article  MATH  Google Scholar 

  16. van Leeuwen M, Ukkonen A (2013) Discovering skylines of subgroup sets. In: ECML PKDD’13: proceeding of the European conference on machine learning and knowledge discovery in databases. Springer, pp 272–287

  17. Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5):652–663

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Willy Ugarte, Bruno Crémilleux, Chedy Raïssi and Benjamin Négrevergne for providing the source codes of their algorithms and for their valuable comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Loïc Cerf.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work has been partially funded by the FAPEMIG under Grant No. APQ-04224-16 (Multilateral Cooperation FAPEMIG-CNRS) and by the ERC Starting Grant No. 679515.

Responsible editor: Po-ling Loh, Evimaria Terzi, Antti Ukkonen, Karsten Borgwardt and Katharina Heinrich.

A Piecewise (Anti-)Monotonicity of the Slope Measure

A Piecewise (Anti-)Monotonicity of the Slope Measure

To simplify the proof that the slope is piecewise (anti-)monotone, all the outputs of the x and y data-access functions, i.e., the abscissas and the ordinates of the points, are supposed positive. If it is not the case, \(\min _{t \in \prod _{i \in I} X_i} x(t)\) is subtracted from every abscissa and \(\min _{t \in \prod _{i \in I} X_i} y(t)\) is subtracted from every ordinate, what moves all the points to the positive quadrant of the Cartesian coordinate system. The slope of the fitting line being invariant under translation, \(x \ge 0\) and \(y \ge 0\) are assumed without loss of generality.

A rewriting \(m'_{\text {slope}}\) of the slope \(m_{\text {slope}}\) maps \(({L}, {U}) \in \left( \prod _{i = 1}^n 2^{D_i}\right) ^2\) to:

  1. case 1.

    if denom\(({U}, {L}) > 0\) then

    1. (a)

      \(\displaystyle \frac{\text {num}({L}, {U})}{\text {denom}({U}, {L})}\) if num\(({L}, {U}) > 0\)

    2. (a)

      \(\displaystyle \frac{\text {num}({L}, {U})}{\text {denom}({L}, {U})}\) otherwise

  2. case 2.

    if denom\(({L}, {U}) < 0\) then

    1. (a)

      \(\displaystyle \frac{\text {num}({U}, {L})}{\text {denom}({L}, {U})}\) if num\(({U}, {L}) < 0\)

    2. (b)

      \(\displaystyle \frac{\text {num}({U}, {L})}{\text {denom}({U}, {L})}\) otherwise

  3. case 3.

    otherwise \(+\infty \)

where \(\forall (X^1, X^2) = (X_1^1, \dots , X_n^1, X_1^2, \dots , X_n^2) \in \left( \prod _{i = 1}^n 2^{D_i}\right) ^2\):

  • num\((X^1, X^2) = \displaystyle \sum _{t \in \prod _{i \in I} X_i^2} x(t) \sum _{t \in \prod _{i \in I} X_i^2} y(t) - \left| \prod _{i \in I} X_i^1\right| \sum _{t \in \prod _{i \in I} X_i^1} x(t)y(t)\);

  • denom\((X^1, X^2) = \displaystyle \left( \sum _{t \in \prod _{i \in I} X_i^2} x(t)\right) ^2 - \left| \prod _{i \in I} X_i^1\right| \sum _{t \in \prod _{i \in I} X_i^1} x(t)^2\).

The equality \(m'_{\text {slope}}(X, X) = m_{\text {slope}}(X)\), for any pattern \(X \in \prod _{i = 1}^n 2^{D_i}\), derives from the equality \(\frac{\text {num}(X, X)}{\text {denom}(X, X)} = m_{\text {slope}}(X)\), for cases 1 and 2 in the definition of \(m'_{\text {slope}}\), and from the nullity of denom(XX) in case 3.

The rewriting \(m'_{\text {slope}}\) actually proves that \(m_{\text {slope}}\) is piecewise (anti-)monotone. To show it, following Definition 8, let us take \(U \in \prod _{i = 1}^n 2^{D_i}\), \(X \in \prod _{i = 1}^n 2^{U_i}\) and \(L \in \prod _{i = 1}^n 2^{X_i}\). L being a sub-pattern of X, its subsets of the dimensions with indexes in I are subsets of those of X, i.e., \(\forall i \in I\), \(L_i \subseteq X_i\). That implies \(\prod _{i \in I} L_i \subseteq \prod _{i \in I} X_i\), which in turn implies both \(\left| \prod _{i \in I} L_i\right| \le \left| \prod _{i \in I} X_i\right| \) and \(\sum _{t \in \prod _{i \in I} L_i} x(t)^2 \le \sum _{t \in \prod _{i \in I} X_i} x(t)^2\). As a consequence, the (positive) quantity subtracted in the expression of denom is smaller if L, rather than X, is input as the first argument. U being a super-pattern of X, the first sum, in the expression of denom, involves more terms when U, rather than X, is input as the second argument. Because \(x \ge 0\), that sum is greater and so is its square. Combining the results on both parts in the expression of denom, \(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) stands. It entails \(\hbox {denom}(X, X) > 0 \Rightarrow \)\(\hbox {denom}(L, U) > 0\), i.e., if (XX) triggers case 1 of \(m'_{\text {slope}}\) then (LU) cannot trigger case 2.

The same steps as in the previous paragraph, but considering X or its super-pattern U as the first input of denom, X or its sub-pattern L as the second input of denom, prove \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X)\). That inequality entails \(\hbox {denom}(X, X) < 0 \Rightarrow \)\(\hbox {denom}(U, L) < 0\), i.e., if (XX) triggers case 2 of \(m'_{\text {slope}}\) then (LU) cannot trigger case 1. Also, \(\hbox {denom}(X, X) = 0\) implies both \(\hbox {denom}(U, L) \le 0\) and \(\hbox {denom}(L, U) \ge 0\), i.e., if (XX) triggers case 3 then (LU) triggers neither case 1 nor case 2. Given all the impossibilities proven so far, if (XX) triggers case \(k \in \{1, 2, 3\}\) in the definition of \(m'_{\text {slope}}\) then (LU) triggers either case k or case 3.

If (LU) triggers case 3, \(m_{\text {slope}}(X) = m'_{\text {slope}}(X, X) \le m'_{\text {slope}}(L, U) = +\infty \). It remains to prove \(m_{\text {slope}}(X) \le m'_{\text {slope}}(L, U)\) when (XX) and (LU) both trigger case 1 or when they both trigger case 2. An analysis of the expression of num, which is analog to the earlier analysis of denom and uses both \(x \ge 0\) and \(y \ge 0\), proves \(\hbox {num}(U, L) \le \)\(\hbox {num}(X, X) \le \)\(\hbox {num}(L, U)\) and, in sequence, the impossibility for (LU) to trigger a sub-case (b) if (XX) triggers the related sub-case (a). If, on the contrary, (XX) triggers a sub-case (b) and (LU) triggers the related sub-case (a) then \(m(X) = m'_{\text {slope}}(X, X) \le m'_{\text {slope}}(L, U)\). Indeed, given the tests in \(m'_{\text {slope}}\) and the inequations \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) that were proven above, the sub-cases (a) always provide positive outputs, whereas the sub-cases (b) always provide negative (hence smaller) outputs.

Finally, when (XX) and (LU) trigger, in the definition of \(m'_{\text {slope}}\), not only a same case but also a same sub-case, \(m_{\text {slope}}(X) \le m'_{\text {slope}}(L, U)\) still stands. Indeed, the inequality \(\hbox {num}(U, L) \le \)\(\hbox {num}(X, X) \le \)\(\hbox {num}(L, U)\) and the inequality \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) together entail:

  • \(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(L, U)}{\text {denom}(U, L)}\) if the two numerators and the two denominators are positive, i.e., in case 1a;

  • \(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(L, U)}{\text {denom}(L, U)}\) if the two numerators are negative and the two denominators are positive, i.e., in case 1b;

  • \(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(U, L)}{\text {denom}(L, U)}\) if the two numerators and the two denominators are negative, i.e., in case 2a;

  • \(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(U, L)}{\text {denom}(U, L)}\) if the two numerators are positive and the two denominators are negative, i.e., in case 2b.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nadisic, N., Coussat, A. & Cerf, L. Mining skypatterns in fuzzy tensors. Data Min Knowl Disc 33, 1298–1322 (2019). https://doi.org/10.1007/s10618-019-00640-4

Download citation

Keywords

  • Pattern mining
  • Skypattern
  • Fuzzy tensor
  • Search space pruning