Encyclopedia of GIS

2017 Edition
| Editors: Shashi Shekhar, Hui Xiong, Xun Zhou

Clustering of Geospatial Big Data in a Distributed Environment

  • Thomas TripletEmail author
  • Samuel Foucher
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-17885-1_1625

Historical Background

Clustering, sometimes called unsupervised learning/classification or exploratory data analysis, is one of the most fundamental steps in understanding a dataset, aiming to discover the unknown nature of data through the separation of a finite dataset, with little or no ground truth, into a finite and discrete set of “natural,” hidden data structures. Given a set of n points in a two-dimensional space, the purpose of clustering is to group them into a number of sets based on similarity measures and distance vectors. Clustering is also useful for compression purpose in large databases (Daschiel and Datcu 2005). The term Unsupervised Learningis sometimes used in some fields (i.e., in Machine Learning and Data Mining). Clustering will usually aim at creating homogeneous groups that are maximally separable. It is a fundamental tool in Knowledge Discovery and Data...

This is a preview of subscription content, log in to check access.


  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data (SIGMOD’98), New York. ACM, pp 94–105CrossRefGoogle Scholar
  2. Alam S, Dobbie G, Koh YS, Riddle P, Rehman SU (2014) Research on particle swarm optimization based clustering: a systematic review of literature and techniques. Swarm Evol Comput 17(0):1–13CrossRefGoogle Scholar
  3. Andrienko G (2008) Spatio-temporal aggregation for visual analysis of movements. In: Proceedings of IEEE symposium on visual analytics science and technology (VAST 2008), Columbus, pp 51–58Google Scholar
  4. Austwick MZ, O’Brien O, Strano E, Viana M (2013) The structure of spatial networks and communities in bicycle sharing systems. PLoS ONE 8(9):e74685, 09Google Scholar
  5. Avvenuti M, Cresci S, Marchetti A, Meletti C, Tesconi M (2014) Ears (earthquake alert and report system): a real time decision support system for earthquake crisis management. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’14), New York. ACM, pp 1749–1758Google Scholar
  6. Brewer E (2012) Cap twelve years later: how the “rules” have changed. Computer 45(2):23–29CrossRefGoogle Scholar
  7. Cattell R (2011) Scalable SQL and NoSQL data stores. SIGMOD Rec 39(4):12–27CrossRefGoogle Scholar
  8. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th symposium on operating systems design and implementation (OSDI’06), Berkeley. USENIX Association, pp 205–218Google Scholar
  9. Chen X, Vo H, Aji A, Wang F (2014) High performance integrated spatial big data analytics. In: Proceedings of the 3rd ACM SIGSPATIAL international workshop on analytics for big geospatial data (BigSpatial’14), New York. ACM, pp 11–14Google Scholar
  10. Dai H-K, Su H-C (2003) Approximation and analytical studies of inter-clustering performances of space-filling curves. In: Banderier C, Krattenthaler C (eds) Discrete random walks (DRW’03), Paris, Sept 1–5 2003. Discrete mathematics and theoretical computer science proceedings, vol AC. DMTCS, pp 53–68Google Scholar
  11. Daschiel H, Datcu M (2005) Information mining in remote sensing image archives: system evaluation. IEEE Trans Geosci Remote Sens 43(1):188–199zbMATHCrossRefGoogle Scholar
  12. Dean J, Ghemawat S Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on opearting systems design & implementation (OSDI’04), vol 6, Berkeley. USENIX Association, pp 10–10Google Scholar
  13. DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W (2007) Dynamo: Amazon’s highly available key-value store. In: Proceedings of twenty-first ACM SIGOPS symposium on operating systems principles (SOSP’07), New York. ACM, pp 205–220CrossRefGoogle Scholar
  14. Ehrlich R, Bezdek JC, Fullh W (1984) Fcm: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203Google Scholar
  15. Eldawy A, Mokbel MF (2015) Spatialhadoop: a mapreduce framework for spatial data. In: Proceedings of the 31st IEEE international conference on data engineering (ICDE), SeoulGoogle Scholar
  16. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) Second international conference on knowledge discovery and data mining. AAAI Press, Palo Alto, pp 226–231Google Scholar
  17. Foth N (2010) Long-term change around skytrain stations in Vancouver, Canada: a demographic shift-share analysis. Geograph Bull 51:37–52MathSciNetGoogle Scholar
  18. Fox A, Eichelberger C, Hughes J, Lyon S (2013) Spatio-temporal indexing in non-relational distributed databases. In: 2013 IEEE international conference on big data, Santa Clara, pp 291–299Google Scholar
  19. Gahlot V, Swami BL, Parida M, Kalla P (2012) User oriented planning of bus rapid transit corridor in GIS environment. Int J Sustain Built Environ 1:102–109CrossRefGoogle Scholar
  20. Gao H, Jiang J, She L, Fu Y (2010) A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework. J Digit Content Technol Appl 4(3):95–100Google Scholar
  21. Ghemawat S, Gobioff H, Leung S-T (2003) The google file system. In: Proceedings of the 19th ACM symposium on operating systems principles (SOSP ’03), New York. ACM, pp 29–43Google Scholar
  22. Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33(2):51–59CrossRefGoogle Scholar
  23. Gouineau F, Landry T, Triplet T (2016) PatchWork: a scalable density-grid clustering algorithm. In: Proceedings of the 31st ACM symposium on applied computing, data mining track, PisaGoogle Scholar
  24. Hagenauer J, Helbich M (2013) Contextual neural gas for spatial clustering and analysis. Int J Geograph Inf Sci 27:251–266CrossRefGoogle Scholar
  25. He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Front Comput Sci 8(1):83–99MathSciNetCrossRefGoogle Scholar
  26. Hinneburg A, Gabriel H-H (2007) Denclue 2.0: fast clustering based on kernel density estimation. In: Proceedings of the 7th international conference on intelligent data analysis (IDA’07). Springer, Berlin/Heidelberg, pp 70–80Google Scholar
  27. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98), New York, 27–31 Aug 1998. AAAI Press, pp 58–65Google Scholar
  28. Hong-bo X, Zhong-xiao H, Qi-Long H (2009) A clustering algorithm based on grid partition of space-filling curve. In: 2009 fourth international conference on internet computing for science and engineering (ICICSE), Harbin, pp 260–265Google Scholar
  29. Hruschka ER, Campello RJGB, Freitas AA, de Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C Appl Rev 39(2):133–155CrossRefGoogle Scholar
  30. ISO (2004) Geographic information—simple feature access—Part 1: common architecture. ISO 19125–1:2004, International Organization for Standardization, GenevaGoogle Scholar
  31. ISO (2008) Geographic information—simple feature access—Part 2: SQL option. ISO 19125–2:2004, International Organization for Standardization, GenevaGoogle Scholar
  32. Jestes J, Yi K, Li F (2011) Building wavelet histograms on large data in mapreduce. Proc VLDB Endow 5(2):109–120CrossRefGoogle Scholar
  33. Jin C, Patwary MMA, Agrawal A, Hendrix W, Liao W-k, Choudhary A (2013) Disc: a distributed single-linkage hierarchical clustering algorithm using mapreduce. In: Proceedings of the 4th international SC workshop on data intensive computing in the clouds, Denver. (http://datasys.cs.iit.edu/events/DataCloud2013/)
  34. Jin C, Liu R, Chen Z, Hendrix W, Agrawal A, Choudhary A (2015) A scalable hierarchical clustering algorithm using spark. In: IEEE first international conference on big data computing service and applications, Redwood City, pp 418–426Google Scholar
  35. Kanellakis PC, Kuper GM, Revesz P (1995) Constraint query languages. J Comput Syst Sci 51(1):26–52MathSciNetCrossRefGoogle Scholar
  36. Kisilevich S, Mansmann F, Keim D (2010a) P-dbscan: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In: Proceedings of the 1st international conference and exhibition on computing for geospatial research & application (COM.Geo ’10), Washington, DC. ACM, Springer, pp 1–4. (http://www.springer.com/us/book/9780387098227)
  37. Kisilevich S, Mansmann F, Nanni M, Rinzivillo S (2010b) Spatio-temporal clustering. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, pp 855–874. http://www.springer.com/us/book/9780387098227
  38. Kuijpers B, Alvares LO, Palma AT, Bogorny V (2008) A clustering-based approach for discovering interesting places in trajectories. In: Proceedings of the 2008 ACM symposium on applied computing, Fortaleza, pp 863–868Google Scholar
  39. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137MathSciNetzbMATHCrossRefGoogle Scholar
  40. Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel k-means clustering of remote sensing images based on MapReduce. In: Proceedings of the 2010 international conference on web information systems and mining (WISM’10). Springer, Berlin/Heidelberg, pp 162–170Google Scholar
  41. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley/Los AngelesGoogle Scholar
  42. Miller HJ (2010) The data avalanche is here. Shouldn’t we be digging? J Reg Sci 50:181–201CrossRefGoogle Scholar
  43. Ng RT, Han J, Ieee Computer Society (2005) Clarans: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 1003–1017Google Scholar
  44. Noticewala M, Vaghela D (2014) Article: Mr-idbscan: efficient parallel incremental dbscan algorithm using mapreduce. Int J Comput Appl 93(4):13–18Google Scholar
  45. Patwary MA, Palsetia D, Agrawal A, Liao W-k, Manne F, Choudhary A (2012) A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC’12), Los Alamitos. IEEE Computer Society Press, pp 62:1–62:11Google Scholar
  46. Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. Proc Int Confer Very Large Data Bases 24:428–439Google Scholar
  47. Stonebraker M (1986) The case for shared nothing. IEEE Database Eng Bull 9(1):4–9Google Scholar
  48. Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases (VLDB’97), San Francisco. Morgan Kaufmann Publishers Inc, pp 186–195Google Scholar
  49. Webber J (2012) A programmatic introduction to neo4j. In: Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity (SPLASH’12), New York. ACM, pp 217–218CrossRefGoogle Scholar
  50. Wood J, O’Brien O, Slingsby A, Dykes J (2011) Visualizing the dynamics of London’s bicycle-hire scheme. Cartogr Int J Geograph Inf Geovis 46(4):239–251Google Scholar
  51. Xiaoyun C, Yi C, Xiaoli Q, Min Y, Yanshan H (2009) PGMCLU: a novel parallel grid-based clustering algorithm for multi-density datasets. In: 1st IEEE symposium on web society, 2009 (SWS’09), Lanzhou, pp 166–171Google Scholar
  52. Yu Y, Zhao J, Wang X, Wang Q, Zhang Y (2015) Cludoop: an efficient distributed density-based clustering for big data using Hadoop. Int J Distrib Sensor Netw 2015(2):1–13Google Scholar
  53. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX conference on hot topics in cloud computing (HotCloud’10), Berkeley. USENIX Association, pp 10–10Google Scholar
  54. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI’12), Berkeley. USENIX Association, pp 2–2Google Scholar
  55. Zhang H, Zhou Y, Li J, Wang X, Yan B (2010) Analyze the wild birds’ migration tracks by MPI-based parallel clustering algorithm. In: Proceedings of the 6th international conference on advanced data mining and applications: Part I (ADMA’10). Springer, Berlin/Heidelberg, pp 383–393CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Computer Research Institute of MontrealMontrealCanada