Data Mining and Knowledge Discovery

, Volume 28, Issue 5–6, pp 1366–1397 | Cite as

Unsupervised interaction-preserving discretization of multivariate data

  • Hoang-Vu Nguyen
  • Emmanuel Müller
  • Jilles Vreeken
  • Klemens Böhm
Article

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

Keywords

Discretization Interaction preservation Pattern mining  Outlier mining Classification 

Notes

Acknowledgments

We thank the anomymous reviewers for their insightful comments. Hoang-Vu Nguyen is supported by the German Research Foundation (DFG) within GRK 1194. Emmanuel Müller is supported by the YIG program of KIT as part of the German Excellence Initiative. Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government. Emmanuel Müller and Jilles Vreeken are supported by Post-Doctoral Fellowships of the Research Foundation—Flanders (fwo).

References

  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.Google Scholar
  2. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.Google Scholar
  3. Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.Google Scholar
  4. Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATHGoogle Scholar
  5. Allen JF, Ferguson G (1994) Actions and events in interval temporal logic. J Log Comput 4(5):531–579CrossRefMATHMathSciNetGoogle Scholar
  6. Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATHGoogle Scholar
  7. Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512CrossRefMATHGoogle Scholar
  8. Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.Google Scholar
  9. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATHGoogle Scholar
  10. Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNetGoogle Scholar
  11. Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.Google Scholar
  12. Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.Google Scholar
  13. Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.Google Scholar
  14. Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATHGoogle Scholar
  15. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNetGoogle Scholar
  16. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.Google Scholar
  17. Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.Google Scholar
  18. Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226CrossRefMathSciNetGoogle Scholar
  19. Grünwald PD (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
  20. Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.Google Scholar
  21. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNetGoogle Scholar
  22. Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.Google Scholar
  23. Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.Google Scholar
  24. Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226Google Scholar
  25. Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.Google Scholar
  26. Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New YorkCrossRefMATHGoogle Scholar
  27. Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.Google Scholar
  28. Liu R, Yang L (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677CrossRefMATHMathSciNetGoogle Scholar
  29. Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRefGoogle Scholar
  30. Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRefGoogle Scholar
  31. Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.Google Scholar
  32. Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.Google Scholar
  33. Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.Google Scholar
  34. Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRefGoogle Scholar
  35. Philip Preuß HD Ruprecht Puchstein (2013) Detection of multiple structural breaks in multivariate time series. arXiv:1309.1309v1.
  36. Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATHGoogle Scholar
  37. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRefGoogle Scholar
  38. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATHGoogle Scholar
  39. Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431CrossRefMATHMathSciNetGoogle Scholar
  40. Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)Google Scholar
  41. Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNetGoogle Scholar
  42. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATHGoogle Scholar
  43. Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.Google Scholar
  44. Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863Google Scholar
  45. Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRefGoogle Scholar
  46. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATHGoogle Scholar
  47. Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRefGoogle Scholar
  48. Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645Google Scholar
  49. Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Hoang-Vu Nguyen
    • 1
  • Emmanuel Müller
    • 1
  • Jilles Vreeken
    • 2
    • 3
  • Klemens Böhm
    • 1
  1. 1.Karlsruhe Institute of Technology (KIT)KarlsruheGermany
  2. 2.Max-Planck Institute for InformaticsSaarbrückenGermany
  3. 3.Cluster of Excellence MMCISaarland UniversitySaarbrückenGermany

Personalised recommendations