Skip to main content
Log in

GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Major challenges of clustering geo-referenced data include identifying arbitrarily shaped clusters, properly utilizing spatial information, coping with diverse extrinsic characteristics of clusters and supporting region discovery tasks. The goal of region discovery is to identify interesting regions in geo-referenced datasets based on a domain expert’s notion of interestingness. Almost all agglomerative clustering algorithms only focus on the first challenge. The goal of the proposed work is to develop agglomerative clustering frameworks that deal with all four challenges. In particular, we propose a generic agglomerative clustering framework for geo-referenced datasets (GAC-GEO) generalizing agglomerative clustering by allowing for three plug-in components. GAC-GEO agglomerates neighboring clusters maximizing a plug-in fitness function that capture the notion of interestingness of clusters. It enhances typical agglomerative clustering algorithms in two ways: fitness functions support task-specific clustering, whereas generic neighboring relationships increase the number of merging candidates. We also demonstrate that existing agglomerative clustering algorithms can be considered as specific cases of GAC-GEO. We evaluate the proposed framework on an artificial dataset and two real-world applications involving region discovery. The experimental results show that GAC-GEO is capable of identifying arbitrarily shaped hotspots for different data mining tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anders KH (2003) A hierarchical graph-clustering approach to find groups of objects. Technical Paper. In: ICA commission on map generalization, the 5th workshop on progress in automated map generalization

  2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM-SIGMOD international conference on management of data. ACM Press, Philadelphia, Pennsylvania, pp 49–60

  3. Chaoji V, Hasan MA, Salem S, Zaki MJ (2009) SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters. Knowl Inf Syst 21(2): 201–229

    Article  Google Scholar 

  4. Choo J, Jiamthapthaksin R, Chen C, Celepcikay O, Giusti C, Eick CF (2007) MOSAIC: a proximity graph approach to agglomerative clustering. In: Proceedings of the 9th international conference on data warehousing and knowledge discovery (DaWaK), pp 231–240

  5. Davidson I, Ravi SS (2005) Hierarchical clustering with constraints: Theory and practice. In: Proceedings of the 9th European conference on machine learning and principles and practice of knowledge discovery in databases, pp 59–70

  6. Ding W, Eick CF, Yuan X, Wang J, Nicot J-P (2007) On regional association rule scoping. In: Proceedings of international workshop on spatial and spatio-temporal data mining (SSTDM),vol 30, pp 595–600

  7. Ding W, Jiamthapthaksin R, Parmar R, Jiang D, Stepinski T, Eick CF (2008) Towards region discovery in spatial datasets. In: Proceedings of pacific-asia conference on knowledge discovery and data mining (PAKDD), pp 88–99

  8. DMML datasets (2008) Datasets. In: Data mining and machine learning group website, University of Houston, Texas. http://www.tlc2.uh.edu/dmmlg/Datasets. Accessed 1 July 2008

  9. Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32(7): 978–986

    Article  Google Scholar 

  10. Eick CF, Parmar R, Ding W, Stepinki T, Nicot J-P (2008) Finding regional co-location patterns for sets of continuous variables in spatial datasets. In: Proceedings of the 16th ACM SIGSPATIAL international conference on advances in GIS (ACM-GIS)

  11. Eick CF, Vaezian B, Jiang D, Wang J (2006) Discovery of interesting regions in spatial datasets using supervised clustering. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 127–138

  12. EPA (2008) Databases and software. In: U.S. Environmental Protection Agency (EPA) official website. http://www.epa.gov/epahome/data.html. Accessed 1 Aug 2008

  13. Ester M, Kriegel HP, Sander J, Xu X (1996) Density-based spatial clustering of applications with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 2976–2981

  14. Gao D, Peuquet D, Gahegan M (2002) Opening the black box: interactive hierarchical clustering for multivariate spatial patterns. In: Proceedings of the 10th ACM international symposium on advances in geographic information systems, pp 131–136

  15. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 73–84

  16. Hinneburg A, Keim D (1998) An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, pp 58–65

  17. Jiang B (2004) Spatial clustering for mining knowledge in support of generalization processes in GIS. In: ICA workshop on generalization and multiple representation

  18. Karypis G, Han EH, Kumar V (1999) CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. In: IEEE Computer 32(8):68–75

  19. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  20. Koga H, Ishibashi T, Watanabe T (2006) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inf Syst 12(1): 25–53

    Article  Google Scholar 

  21. Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of international conference on knowledge discovery in data mining, pp 672–677

  22. Lin C, Chen M (2002) A robust and efficient clustering algorithm based on cohesion self-merging. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 582–587

  23. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, University of California Press, pp 281–297

  24. Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4): 308–313

    MATH  Google Scholar 

  25. Netflix (2009) Netflix prize official website. http://www.netflixprize.com/. Accessed 1 Oct 2009

  26. NOAA (2008) Explore NOAA. In: National Oceanic and Atmospheric Administration (NOAA) official website. http://www.noaa.gov/. Accessed 1 Sept 2008

  27. Otoo EJ, Shoshani A, Hwang S-W (2001) Clustering high dimensional massive scientific datasets. Intell Inf Syst 17(2–3): 147–168

    Article  MATH  Google Scholar 

  28. Piotte M, Chabbert M (2009) The Pragmatic Theory solution to the Netflix Grand Prize (Report from the Netflix Prize Winners)

  29. Rinsurongkawong V, Eick CF (2008) Change analysis in spatial datasets by interestingness comparison. ACM-SIGSPATIAL Newsletter, pp 33–38

  30. Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2): 169–194

    Article  Google Scholar 

  31. Tan PN, Steinbach M, Kumar V (2005) Introduction to Data Mining. Addison Wesley, Reading

    Google Scholar 

  32. TWDB (2008) TWDB data. In: Texas Water Development Board (TWDB) official website. http://www.twdb.state.tx.us/data/data.asp. Accessed 1 Sept 2008

  33. UCI repository (2008) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed 1 Aug 2008

  34. Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19(3): 311–336

    Article  Google Scholar 

  35. Zhong S, Ghosh J (2003) A unified framework for model-based clustering. Mach Learn Res 4: 1001–1037

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachsuda Jiamthapthaksin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiamthapthaksin, R., Eick, C.F. & Lee, S. GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf Syst 29, 597–628 (2011). https://doi.org/10.1007/s10115-010-0355-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0355-3

Keywords

Navigation