Skip to main content

Advertisement

Log in

Anytime density-based clustering of complex data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Many clustering algorithms suffer from scalability problems on massive datasets and do not support any user interaction during runtime. To tackle these problems, anytime clustering algorithms are proposed. They produce a fast approximate result which is continuously refined during the further run. Also, they can be stopped or suspended anytime to provide an intermediate answer. In this paper, we propose a novel anytime clustering algorithm modeled on the density-based clustering paradigm. Our algorithm called A-DBSCAN is applicable to many complex data such as trajectory and medical data. The general idea of our algorithm is to use a sequence of lower bounding functions (LBs) of the true distance function to produce multiple approximate results of the true density-based clusters. A-DBSCAN operates in multiple levels w.r.t. the LBs and is mainly based on two algorithmic schemes: (1) an efficient distance upgrade scheme which restricts distance calculations to core objects at each level of the LBs and (2) a local reclustering scheme which restricts update operations to the relevant objects only. To further improve the performance, we propose a significant extension version of A-DBSCAN called A-DBSCAN-XS which is built upon the anytime scheme of A-DBSCAN and the \(\mu \)-range query scheme of a data structure called extended Xseedlist. A-DBSCAN-XS requires less distance calculations at each level than A-DBSCAN and thus is more efficient. Extensive experiments demonstrate that A-DBSCAN and A-DBSCAN-XS acquire very good clustering results at very early stages of execution and thus save a large amount of computational time. Even if they run to the end, A-DBSCAN and A-DBSCAN-XS are still orders of magnitude faster than the original algorithm DBSCAN and its variants. We also introduce a novel application for our algorithms for the segmentation of the white matter fiber tracts in human brain which is an important tool for studying the brain structure and various diseases such as Alzheimer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. http://www.cs.ucr.edu/~eamonn/time_series_data/.

  2. http://www.cs.ucr.edu/~eamonn/time_series_data/.

  3. http://archive.ics.uci.edu/ml.

  4. http://www1.cs.columbia.edu/CAVE/software/softlib/coil-20.php.

References

  1. Ueno K, Xi X, Keogh E, Lee D (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632

  2. Zhu Q, Batista G, Rakthanmanon T, Keogh E (2012) A novel approximation to dynamic time warping allows anytime clustering of massive time series datasets. In: SDM, pp 999–1010

  3. Zilberstein S, Russell SJ (1995) Approximate reasoning using anytime algorithms. In: Natarajan S (ed) Imprecise and approximate computation. Kluwer Academic Publishers. http://rbr.cs.umass.edu/shlomo/papers/ZRchapter95.html

  4. Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322

  5. Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: EDBT, pp 311–322

  6. Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: DASFAA (1), pp 228–242

  7. Assent I, Kranen P, Baldauf C, Seidl T (2010) Detecting outliers on arbitrary data streams using anytime approaches. In: StreamKDD. ACM, New York, NY, USA, pp 10–16

  8. Lin J, Vlachos M, Keogh E, Gunopulos D (2004) Iterative incremental clustering of time series. In: EDBT, pp 106–122

  9. Lin J, Vlachos M, Keogh E, Gunopulos D, Liu J, Yu S, Le J (2005) A MPAA-based iterative clustering algorithm augmented by nearest neighbors search for time-series data streams. In: PAKDD, pp 333–342

  10. Ester M, Kriegel H, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: VLDB, pp 323–333

  11. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp 226–231

  12. Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD, pp 49–60

  13. Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: ICDE, pp 126–133

  14. Keogh EJ (2002) Exact indexing of dynamic time warping. In: VLDB, pp 406–417

  15. Sakurai Y, Yoshikawa M, Faloutsos C (2005) FTW: fast similarity search under the time warping distance. In: PODS. ACM, New York, NY, USA, pp 326–337

  16. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow 1:1542–1552

    Article  Google Scholar 

  17. Brecheisen S, Kriegel H, Pfeif1e M (2004) Efficient density-based clustering of complex objects. In: ICDM, pp 43–50

  18. Brecheisen S, Kriegel H, Pfeifle M (2006) Parallel density-based clustering of complex objects. In: PAKDD, pp 179–188

  19. Kröger P, Kriegel H, Kailing K (2004) Density-connected subspace clustering for high-dimensional data. In: SDM, pp 246–256

  20. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML, pp 1073–1080

  21. Mai ST, He X, Feng J, Böhm C (2013) Efficient anytime density-based clustering. In: SDM, pp 112–120

  22. Morris BT, Trivedi MM (2009) Learning trajectory patterns by clustering: experimental studies and comparative evaluation. In: CVPR, pp 312–319

  23. Dom BE (2001) An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM

  24. Lee J, Han J (2007) Trajectory clustering: a partition-and-group framework. In: SIGMOD, pp 593–604

  25. Zhang K, Kwok JT (2009) Density-weighted Nyström method for computing large kernel eigensystems. Neural Comput 21(1):121–146

    Article  MATH  MathSciNet  Google Scholar 

  26. Shang F, Jiao LC, Shi J, Gong M, Shang R (2011) Fast density-weighted low-rank approximation spectral clustering. Data Min Knowl Discov 23(2):345–378

    Article  MATH  MathSciNet  Google Scholar 

  27. Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957

    Article  Google Scholar 

  28. Mai ST (2014) Density-based algorithms for active and anytime clustering. PhD thesis, University of Munich

  29. Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: ICDM, pp 249–258

  30. Mai ST, He X, Hubig N, Plant C, Böhm C (2013) Active density-based clustering. In: ICDM, pp 508–517

  31. Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: ICDM, pp 1014–1019

  32. Mai ST (2013) Density-based clustering: a comprehensive survey. University of Munich, Technical report

  33. Kriegel H, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Data Mining Knowl Discov 1(3):231–240

    Article  Google Scholar 

  34. Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194

    Article  Google Scholar 

  35. Mori S (2007) Introduction to diffusion tensor imaging. Elsevier, Amsterdam

    Google Scholar 

  36. Basser PJ, Pajevic S, Pierpaoli C, Duda J, Aldroubi A (2000) In vivo fiber tractography using DT-MRI data. Magn Reson Med 44:625–632

    Article  Google Scholar 

  37. Catani M, Howard RJ, Pajevic S, Jones DK (2002) Virtual in vivo interactive dissection of white matter fasciculi in the human brain. Neuroimage 17(1):77–94

    Article  Google Scholar 

  38. Brun A, Park HJ, Knutsson H, Westin CF (2003) Coloring of DT-MRI fiber traces using Laplacian eigenmaps. In: EUROCAST, pp 564–572

  39. Ding Z, Gore JC, Anderson AW (2003) Classification and quantification of neuronal fiber pathways using diffusion tensor MRI. Mag Res Med 49:716–721

    Article  Google Scholar 

  40. Corouge I, Gerig G, Gouttard S (2004) Towards a shape model of white matter fiber bundles using diffusion tensor MRI. In: ISBI, pp 344–347

  41. Tsai A, Westin CF, Hero AO, Willsky AS (2007) Fiber tract clustering on manifolds with dual rooted-graphs. In: CVPR

  42. Chen W, Ding Z, Zhang S, MacKay-Brandt A, Correia S, Qu H, Crow JA, Tate DF, Yan Z, Peng Q (2009) A novel interface for interactive exploration of DTI fibers. IEEE Trans Vis. Comput Graph 15(6):1433–1440

    Article  Google Scholar 

  43. Maddah M, Eric W, Grimson L, Warfield SK (2006) Statistical modeling and EM clustering of white matter fiber tracts. In: ISBI, pp 53–56

  44. Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: KDD-DMMH, pp 1–9

  45. Sherbondy A, Akers D, Mackenzie R, Dougherty R, Wandell B (2005) Exploring connectivity of the brain’s white matter with dynamic queries. IEEE Trans Vis Comput Graph 11(4):419–430

    Article  Google Scholar 

  46. Wang Q, Yap PT, Jia H, Wu G, Shen D (2010) Hierarchical fiber clustering based on multi-scale neuroanatomical features. In: MIAR, pp 448–456

  47. Zhu Y, Shasha D (2003) Warping indexes with envelope transforms for query by humming. In: SIGMOD, pp 181–192

  48. Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB, pp 385–394

  49. Kim S, Park S, Chu WW (2001) An index-based approach for similarity search supporting time warping in large sequence databases. In: ICDE, pp 607–614

Download references

Acknowledgments

We thank Diep M. T. Phan, Ha H. T. Mai, Hanh M. T. Vo, Nhan M. T. Luong, Quan A. Tran, Ninh A. Nguyen, Anh X. Nghiem, Sebastian Goebl, Nina Hubig, and Franz Krojer for their helps and supports. Our special thanks to Prof. Kai Zhang and Prof. Brian Kulis for kindly providing us the source codes of their papers. We special thank anonymous reviewers for their invaluable comments which help to significantly improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Son T. Mai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mai, S.T., He, X., Feng, J. et al. Anytime density-based clustering of complex data. Knowl Inf Syst 45, 319–355 (2015). https://doi.org/10.1007/s10115-014-0797-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0797-0

Keywords

Navigation