Abstract
Major challenges of clustering geo-referenced data include identifying arbitrarily shaped clusters, properly utilizing spatial information, coping with diverse extrinsic characteristics of clusters and supporting region discovery tasks. The goal of region discovery is to identify interesting regions in geo-referenced datasets based on a domain expert’s notion of interestingness. Almost all agglomerative clustering algorithms only focus on the first challenge. The goal of the proposed work is to develop agglomerative clustering frameworks that deal with all four challenges. In particular, we propose a generic agglomerative clustering framework for geo-referenced datasets (GAC-GEO) generalizing agglomerative clustering by allowing for three plug-in components. GAC-GEO agglomerates neighboring clusters maximizing a plug-in fitness function that capture the notion of interestingness of clusters. It enhances typical agglomerative clustering algorithms in two ways: fitness functions support task-specific clustering, whereas generic neighboring relationships increase the number of merging candidates. We also demonstrate that existing agglomerative clustering algorithms can be considered as specific cases of GAC-GEO. We evaluate the proposed framework on an artificial dataset and two real-world applications involving region discovery. The experimental results show that GAC-GEO is capable of identifying arbitrarily shaped hotspots for different data mining tasks.
Similar content being viewed by others
References
Anders KH (2003) A hierarchical graph-clustering approach to find groups of objects. Technical Paper. In: ICA commission on map generalization, the 5th workshop on progress in automated map generalization
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM-SIGMOD international conference on management of data. ACM Press, Philadelphia, Pennsylvania, pp 49–60
Chaoji V, Hasan MA, Salem S, Zaki MJ (2009) SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters. Knowl Inf Syst 21(2): 201–229
Choo J, Jiamthapthaksin R, Chen C, Celepcikay O, Giusti C, Eick CF (2007) MOSAIC: a proximity graph approach to agglomerative clustering. In: Proceedings of the 9th international conference on data warehousing and knowledge discovery (DaWaK), pp 231–240
Davidson I, Ravi SS (2005) Hierarchical clustering with constraints: Theory and practice. In: Proceedings of the 9th European conference on machine learning and principles and practice of knowledge discovery in databases, pp 59–70
Ding W, Eick CF, Yuan X, Wang J, Nicot J-P (2007) On regional association rule scoping. In: Proceedings of international workshop on spatial and spatio-temporal data mining (SSTDM),vol 30, pp 595–600
Ding W, Jiamthapthaksin R, Parmar R, Jiang D, Stepinski T, Eick CF (2008) Towards region discovery in spatial datasets. In: Proceedings of pacific-asia conference on knowledge discovery and data mining (PAKDD), pp 88–99
DMML datasets (2008) Datasets. In: Data mining and machine learning group website, University of Houston, Texas. http://www.tlc2.uh.edu/dmmlg/Datasets. Accessed 1 July 2008
Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32(7): 978–986
Eick CF, Parmar R, Ding W, Stepinki T, Nicot J-P (2008) Finding regional co-location patterns for sets of continuous variables in spatial datasets. In: Proceedings of the 16th ACM SIGSPATIAL international conference on advances in GIS (ACM-GIS)
Eick CF, Vaezian B, Jiang D, Wang J (2006) Discovery of interesting regions in spatial datasets using supervised clustering. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), pp 127–138
EPA (2008) Databases and software. In: U.S. Environmental Protection Agency (EPA) official website. http://www.epa.gov/epahome/data.html. Accessed 1 Aug 2008
Ester M, Kriegel HP, Sander J, Xu X (1996) Density-based spatial clustering of applications with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 2976–2981
Gao D, Peuquet D, Gahegan M (2002) Opening the black box: interactive hierarchical clustering for multivariate spatial patterns. In: Proceedings of the 10th ACM international symposium on advances in geographic information systems, pp 131–136
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD international conference on management of data, pp 73–84
Hinneburg A, Keim D (1998) An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, pp 58–65
Jiang B (2004) Spatial clustering for mining knowledge in support of generalization processes in GIS. In: ICA workshop on generalization and multiple representation
Karypis G, Han EH, Kumar V (1999) CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. In: IEEE Computer 32(8):68–75
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Koga H, Ishibashi T, Watanabe T (2006) Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowl Inf Syst 12(1): 25–53
Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of international conference on knowledge discovery in data mining, pp 672–677
Lin C, Chen M (2002) A robust and efficient clustering algorithm based on cohesion self-merging. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 582–587
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, University of California Press, pp 281–297
Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7(4): 308–313
Netflix (2009) Netflix prize official website. http://www.netflixprize.com/. Accessed 1 Oct 2009
NOAA (2008) Explore NOAA. In: National Oceanic and Atmospheric Administration (NOAA) official website. http://www.noaa.gov/. Accessed 1 Sept 2008
Otoo EJ, Shoshani A, Hwang S-W (2001) Clustering high dimensional massive scientific datasets. Intell Inf Syst 17(2–3): 147–168
Piotte M, Chabbert M (2009) The Pragmatic Theory solution to the Netflix Grand Prize (Report from the Netflix Prize Winners)
Rinsurongkawong V, Eick CF (2008) Change analysis in spatial datasets by interestingness comparison. ACM-SIGSPATIAL Newsletter, pp 33–38
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2): 169–194
Tan PN, Steinbach M, Kumar V (2005) Introduction to Data Mining. Addison Wesley, Reading
TWDB (2008) TWDB data. In: Texas Water Development Board (TWDB) official website. http://www.twdb.state.tx.us/data/data.asp. Accessed 1 Sept 2008
UCI repository (2008) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed 1 Aug 2008
Xiong H, Steinbach M, Ruslim A, Kumar V (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19(3): 311–336
Zhong S, Ghosh J (2003) A unified framework for model-based clustering. Mach Learn Res 4: 1001–1037
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiamthapthaksin, R., Eick, C.F. & Lee, S. GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf Syst 29, 597–628 (2011). https://doi.org/10.1007/s10115-010-0355-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0355-3