Advertisement

Knowledge and Information Systems

, Volume 45, Issue 2, pp 319–355 | Cite as

Anytime density-based clustering of complex data

  • Son T. MaiEmail author
  • Xiao He
  • Jing Feng
  • Claudia Plant
  • Christian Böhm
Regular Paper

Abstract

Many clustering algorithms suffer from scalability problems on massive datasets and do not support any user interaction during runtime. To tackle these problems, anytime clustering algorithms are proposed. They produce a fast approximate result which is continuously refined during the further run. Also, they can be stopped or suspended anytime to provide an intermediate answer. In this paper, we propose a novel anytime clustering algorithm modeled on the density-based clustering paradigm. Our algorithm called A-DBSCAN is applicable to many complex data such as trajectory and medical data. The general idea of our algorithm is to use a sequence of lower bounding functions (LBs) of the true distance function to produce multiple approximate results of the true density-based clusters. A-DBSCAN operates in multiple levels w.r.t. the LBs and is mainly based on two algorithmic schemes: (1) an efficient distance upgrade scheme which restricts distance calculations to core objects at each level of the LBs and (2) a local reclustering scheme which restricts update operations to the relevant objects only. To further improve the performance, we propose a significant extension version of A-DBSCAN called A-DBSCAN-XS which is built upon the anytime scheme of A-DBSCAN and the \(\mu \)-range query scheme of a data structure called extended Xseedlist. A-DBSCAN-XS requires less distance calculations at each level than A-DBSCAN and thus is more efficient. Extensive experiments demonstrate that A-DBSCAN and A-DBSCAN-XS acquire very good clustering results at very early stages of execution and thus save a large amount of computational time. Even if they run to the end, A-DBSCAN and A-DBSCAN-XS are still orders of magnitude faster than the original algorithm DBSCAN and its variants. We also introduce a novel application for our algorithms for the segmentation of the white matter fiber tracts in human brain which is an important tool for studying the brain structure and various diseases such as Alzheimer.

Keywords

Anytime clustering Density-based clustering Lower bounding distance Fiber segmentation Fiber clustering Diffusion tensor imaging 

Notes

Acknowledgments

We thank Diep M. T. Phan, Ha H. T. Mai, Hanh M. T. Vo, Nhan M. T. Luong, Quan A. Tran, Ninh A. Nguyen, Anh X. Nghiem, Sebastian Goebl, Nina Hubig, and Franz Krojer for their helps and supports. Our special thanks to Prof. Kai Zhang and Prof. Brian Kulis for kindly providing us the source codes of their papers. We special thank anonymous reviewers for their invaluable comments which help to significantly improve the quality of this paper.

References

  1. 1.
    Ueno K, Xi X, Keogh E, Lee D (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632Google Scholar
  2. 2.
    Zhu Q, Batista G, Rakthanmanon T, Keogh E (2012) A novel approximation to dynamic time warping allows anytime clustering of massive time series datasets. In: SDM, pp 999–1010Google Scholar
  3. 3.
    Zilberstein S, Russell SJ (1995) Approximate reasoning using anytime algorithms. In: Natarajan S (ed) Imprecise and approximate computation. Kluwer Academic Publishers. http://rbr.cs.umass.edu/shlomo/papers/ZRchapter95.html
  4. 4.
    Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322Google Scholar
  5. 5.
    Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322Google Scholar
  6. 6.
    Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: DASFAA (1), pp 228–242Google Scholar
  7. 7.
    Assent I, Kranen P, Baldauf C, Seidl T (2010) Detecting outliers on arbitrary data streams using anytime approaches. In: StreamKDD. ACM, New York, NY, USA, pp 10–16Google Scholar
  8. 8.
    Lin J, Vlachos M, Keogh E, Gunopulos D (2004) Iterative incremental clustering of time series. In: EDBT, pp 106–122Google Scholar
  9. 9.
    Lin J, Vlachos M, Keogh E, Gunopulos D, Liu J, Yu S, Le J (2005) A MPAA-based iterative clustering algorithm augmented by nearest neighbors search for time-series data streams. In: PAKDD, pp 333–342Google Scholar
  10. 10.
    Ester M, Kriegel H, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: VLDB, pp 323–333Google Scholar
  11. 11.
    Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231Google Scholar
  12. 12.
    Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD, pp 49–60Google Scholar
  13. 13.
    Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: ICDE, pp 126–133Google Scholar
  14. 14.
    Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417Google Scholar
  15. 15.
    Sakurai Y, Yoshikawa M, Faloutsos C (2005) FTW: fast similarity search under the time warping distance. In: PODS. ACM, New York, NY, USA, pp 326–337Google Scholar
  16. 16.
    Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow 1:1542–1552CrossRefGoogle Scholar
  17. 17.
    Brecheisen S, Kriegel H, Pfeif1e M (2004) Efficient density-based clustering of complex objects. In: ICDM, pp 43–50Google Scholar
  18. 18.
    Brecheisen S, Kriegel H, Pfeifle M (2006) Parallel density-based clustering of complex objects. In: PAKDD, pp 179–188Google Scholar
  19. 19.
    Kröger P, Kriegel H, Kailing K (2004) Density-connected subspace clustering for high-dimensional data. In: SDM, pp 246–256Google Scholar
  20. 20.
    Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp 1073–1080Google Scholar
  21. 21.
    Mai ST, He X, Feng J, Böhm C (2013) Efficient anytime density-based clustering. In: SDM, pp 112–120Google Scholar
  22. 22.
    Morris BT, Trivedi MM (2009) Learning trajectory patterns by clustering: experimental studies and comparative evaluation. In: CVPR, pp 312–319Google Scholar
  23. 23.
    Dom BE (2001) An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBMGoogle Scholar
  24. 24.
    Lee J, Han J (2007) Trajectory clustering: a partition-and-group framework. In: SIGMOD, pp 593–604Google Scholar
  25. 25.
    Zhang K, Kwok JT (2009) Density-weighted Nyström method for computing large kernel eigensystems. Neural Comput 21(1):121–146zbMATHMathSciNetCrossRefGoogle Scholar
  26. 26.
    Shang F, Jiao LC, Shi J, Gong M, Shang R (2011) Fast density-weighted low-rank approximation spectral clustering. Data Min Knowl Discov 23(2):345–378zbMATHMathSciNetCrossRefGoogle Scholar
  27. 27.
    Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957CrossRefGoogle Scholar
  28. 28.
    Mai ST (2014) Density-based algorithms for active and anytime clustering. PhD thesis, University of MunichGoogle Scholar
  29. 29.
    Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: ICDM, pp 249–258Google Scholar
  30. 30.
    Mai ST, He X, Hubig N, Plant C, Böhm C (2013) Active density-based clustering. In: ICDM, pp 508–517Google Scholar
  31. 31.
    Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: ICDM, pp 1014–1019Google Scholar
  32. 32.
    Mai ST (2013) Density-based clustering: a comprehensive survey. University of Munich, Technical reportGoogle Scholar
  33. 33.
    Kriegel H, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Data Mining Knowl Discov 1(3):231–240CrossRefGoogle Scholar
  34. 34.
    Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194CrossRefGoogle Scholar
  35. 35.
    Mori S (2007) Introduction to diffusion tensor imaging. Elsevier, AmsterdamGoogle Scholar
  36. 36.
    Basser PJ, Pajevic S, Pierpaoli C, Duda J, Aldroubi A (2000) In vivo fiber tractography using DT-MRI data. Magn Reson Med 44:625–632CrossRefGoogle Scholar
  37. 37.
    Catani M, Howard RJ, Pajevic S, Jones DK (2002) Virtual in vivo interactive dissection of white matter fasciculi in the human brain. Neuroimage 17(1):77–94CrossRefGoogle Scholar
  38. 38.
    Brun A, Park HJ, Knutsson H, Westin CF (2003) Coloring of DT-MRI fiber traces using Laplacian eigenmaps. In: EUROCAST, pp 564–572Google Scholar
  39. 39.
    Ding Z, Gore JC, Anderson AW (2003) Classification and quantification of neuronal fiber pathways using diffusion tensor MRI. Mag Res Med 49:716–721CrossRefGoogle Scholar
  40. 40.
    Corouge I, Gerig G, Gouttard S (2004) Towards a shape model of white matter fiber bundles using diffusion tensor MRI. In: ISBI, pp 344–347Google Scholar
  41. 41.
    Tsai A, Westin CF, Hero AO, Willsky AS (2007) Fiber tract clustering on manifolds with dual rooted-graphs. In: CVPRGoogle Scholar
  42. 42.
    Chen W, Ding Z, Zhang S, MacKay-Brandt A, Correia S, Qu H, Crow JA, Tate DF, Yan Z, Peng Q (2009) A novel interface for interactive exploration of DTI fibers. IEEE Trans Vis. Comput Graph 15(6):1433–1440CrossRefGoogle Scholar
  43. 43.
    Maddah M, Eric W, Grimson L, Warfield SK (2006) Statistical modeling and EM clustering of white matter fiber tracts. In: ISBI, pp 53–56Google Scholar
  44. 44.
    Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: KDD-DMMH, pp 1–9Google Scholar
  45. 45.
    Sherbondy A, Akers D, Mackenzie R, Dougherty R, Wandell B (2005) Exploring connectivity of the brain’s white matter with dynamic queries. IEEE Trans Vis Comput Graph 11(4):419–430CrossRefGoogle Scholar
  46. 46.
    Wang Q, Yap PT, Jia H, Wu G, Shen D (2010) Hierarchical fiber clustering based on multi-scale neuroanatomical features. In: MIAR, pp 448–456Google Scholar
  47. 47.
    Zhu Y, Shasha D (2003) Warping indexes with envelope transforms for query by humming. In: SIGMOD, pp 181–192Google Scholar
  48. 48.
    Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp 385–394Google Scholar
  49. 49.
    Kim S, Park S, Chu WW (2001) An index-based approach for similarity search supporting time warping in large sequence databases. In: ICDE, pp 607–614Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Son T. Mai
    • 1
    Email author
  • Xiao He
    • 1
  • Jing Feng
    • 1
  • Claudia Plant
    • 2
  • Christian Böhm
    • 1
  1. 1.Institute for InformaticsUniversity of MunichMunichGermany
  2. 2.Helmholtz Zentrum MünchenTechnische Universität MünchenMunichGermany

Personalised recommendations