Knowledge and Information Systems

, Volume 35, Issue 3, pp 493–524 | Cite as

DEMass: a new density estimator for big data

  • Kai Ming Ting
  • Takashi Washio
  • Jonathan R. Wells
  • Fei Tony Liu
  • Sunil Aryal
Regular Paper

Abstract

Density estimation is the ubiquitous base modelling mechanism employed for many tasks including clustering, classification, anomaly detection and information retrieval. Commonly used density estimation methods such as kernel density estimator and \(k\)-nearest neighbour density estimator have high time and space complexities which render them inapplicable in problems with big data. This weakness sets the fundamental limit in existing algorithms for all these tasks. We propose the first density estimation method, having average case sub-linear time complexity and constant space complexity in the number of instances, that stretches this fundamental limit to an extent that dealing with millions of data can now be done easily and quickly. We provide an asymptotic analysis of the new density estimator and verify the generality of the method by replacing existing density estimators with the new one in three current density-based algorithms, namely DBSCAN, LOF and Bayesian classifiers, representing three different data mining tasks of clustering, anomaly detection and classification. Our empirical evaluation results show that the new density estimation method significantly improves their time and space complexities, while maintaining or improving their task-specific performances in clustering, anomaly detection and classification. The new method empowers these algorithms, currently limited to small data size only, to process big data—setting a new benchmark for what density-based algorithms can achieve.

Keywords

Density estimation Density-based algorithms 

References

  1. 1.
    Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management, pp 580–585Google Scholar
  2. 2.
    Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):4:1–4:57CrossRefGoogle Scholar
  3. 3.
    Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 29–38Google Scholar
  4. 4.
    Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, pp 217–235Google Scholar
  5. 5.
    Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning, pp 97–104Google Scholar
  6. 6.
    Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD international conference on management of data, pp 93–104Google Scholar
  7. 7.
    Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of the European working session on learning, pp 164–178Google Scholar
  8. 8.
    Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data, bases, pp 426–435Google Scholar
  9. 9.
    Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings of the 5th international conference on machine learning and applications, IEEE Computer Society, Washington, pp 245–250Google Scholar
  10. 10.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39(1):1–38MathSciNetMATHGoogle Scholar
  11. 11.
    Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference on machine learning, Morgan Kaufmann, pp 194–202Google Scholar
  12. 12.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, AAAI Press, pp 226–231Google Scholar
  13. 13.
    Fayyad UM, Irani KB (1995) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of 14th international joint conference on artificial intelligence, pp 1034–1040Google Scholar
  14. 14.
    Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. URL: http://archive.ics.uci.edu/ml
  15. 15.
    Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163MATHCrossRefGoogle Scholar
  16. 16.
    Hastie T, Tibshirani R, Friedman J (2001) Chapter 8.5 the EM algorithm. In The elements of statistical learning, pp 236–243Google Scholar
  17. 17.
    Hido S, Tsuboi Y, Kashima H, Sugiyama M, Kanamori T (2011) Statistical outlier detection using direct density ratio estimation. Knowl Inf Syst 26(2):309–336CrossRefGoogle Scholar
  18. 18.
    Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th international conference on very large data bases, pp 506–515Google Scholar
  19. 19.
    Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of KDD, AAAI Press, pp 58–65Google Scholar
  20. 20.
    Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mapping into Hilbert space. In: Proceedings of conference in modern analysis and probability, contemporary mathematics, vol 26. American Mathematical Society, pp 189–206Google Scholar
  21. 21.
    Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: Proceedings of the tenth national conference on artificial intelligence, pp 399–406Google Scholar
  22. 22.
    Langley P, John GH (1995) Estimating continuous distribution in Bayesian classifiers. In: Proceedings of eleventh conference on uncertainty in artificial intelligenceGoogle Scholar
  23. 23.
    Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 157–166Google Scholar
  24. 24.
    Liu FT, Ting KM, Zhou Z-H (2010) On detecting clustered anomalies using sciforest. In: Proceedings of ECML PKDD, pp 274–290Google Scholar
  25. 25.
    Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39CrossRefGoogle Scholar
  26. 26.
    Nanopoulos A, Theodoridis Y, Manolopoulos Y (2006) Indexed-based density biased sampling for clustering applications. IEEE Trans Data Knowl Eng 57(1):37–63CrossRefGoogle Scholar
  27. 27.
    Rocke DM, Woodruff DL (1996) Identification of outliers in multivariate data. J Am Stat Assoc 91(435):1047–1061MathSciNetMATHCrossRefGoogle Scholar
  28. 28.
    Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471MATHCrossRefGoogle Scholar
  29. 29.
    Silverman BW (1986) Density estimation for statistics and data analysis. Chapmal & Hall, LondonMATHGoogle Scholar
  30. 30.
    Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, ReadingGoogle Scholar
  31. 31.
    Tan SC, Ting KM, Liu FT (2011) Fast anomaly detection for streaming data. In: Proceedings of IJCAI, pp 1151–1156Google Scholar
  32. 32.
    Tax DMJ, Duin RPW (2004) Support vector data description. Mach Learn 54(1):45–66MATHCrossRefGoogle Scholar
  33. 33.
    Ting KM, Washio T, Wells JR, Liu FT (2011) Density estimation based on mass. In: Proceedings of the 2011 IEEE 11th international conference on data mining, IEEE Computer Society, pp 715–724Google Scholar
  34. 34.
    Ting KM, Wells JR (2010) Multi-dimensional mass estimation and mass-based clustering. In: Proceedings of IEEE international conference on data mining, pp 511–520Google Scholar
  35. 35.
    Ting KM, Zhou G-T, Liu FT, Tan SC (2012) Mass estimation. Mach Learn, pp 1–34. doi:10.1007/s10994-012-5303-x
  36. 36.
    Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer, BerlinMATHGoogle Scholar
  37. 37.
    Vries TD, Chawla S, Houle M (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32:25–52CrossRefGoogle Scholar
  38. 38.
    Webb GI, Boughton JR, Wang Z (2005) Aggregating one-dependence estimators. Mach Learn 58:5–24MATHCrossRefGoogle Scholar
  39. 39.
    Witten IH, Frank E, Hall MA (2011) Data mining: Practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San FranciscoGoogle Scholar
  40. 40.
    Yamanishi K, Takeuchi J-I, Williams G, Milne P (2000) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 320–324Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Kai Ming Ting
    • 1
  • Takashi Washio
    • 2
  • Jonathan R. Wells
    • 1
  • Fei Tony Liu
    • 1
  • Sunil Aryal
    • 1
  1. 1.Gippsland School of Information TechnologyMonash UniversityChurchillAustralia
  2. 2.The Institute of Scientific and Industrial ResearchOsaka UniversityOsakaJapan

Personalised recommendations