3D Crisp Clustering of GeoUrban Data
Synonyms
Definition

Partitional algorithms: divide the data into a set of separate category. This algorithm attempts to define the number of partitions to optimize a certain criterion function. This optimization is an iterative procedure.

Hierarchical algorithms: This algorithm creates clusters repeatedly by merging a small cluster into a larger cluster. It also split cluster into several small classes.

Densitybased algorithms: By using this technique, clusters are generated based on its density function and produced arbitrary shaped clusters.

Gridbased algorithms: These types of algorithms are widely used for the application of spatial data mining. The search space is quantized into a finite number of cells.
Historical Background
The crisp clustering algorithm has been used ubiquitously in many fields and areas such as web mining, spatial data analysis, business, prediction based on groups, and much more. In the past few years, a number of algorithms have been invented and proposed for various applications. These algorithms can be represented based on its categories as follows.
Partitional Algorithms

kmeans
 kmeans is the most widely used crisp clustering algorithms in various applications such as machine learning, statistical analysis, and computer visualization. kmeans was invented by MacQueen in 1967 to deal with the problem of data clustering (MacQueen 1967). The aim of this clustering technique is to optimize the objective function which can be described as follows:From the Eq. (1), the cluster center of C_{ i } is mi, while d is the distance from point x to point m_{ i }. In the equation, the criterion of function E will minimize the distance between point and cluster center. A set of C cluster centers was chosen at the initial step. Then, each object is assigned to the nearest cluster center. The centers are then recomputed, and the process continues until the cluster center stops changing.$$\displaystyle{ E ={\sum _{ i=1}^{c}}{\sum _{ x\in c_{i}}d(x,mi) }}$$(1)

PAM (Partitioning Around Medoid)
 This algorithm attempts to find the medoid for each cluster. The algorithm starts by searching the nearest objects that are located in the cluster. The algorithm of PAM first will compute a k representative object which is a medoid. A medoid is an object that has a very minimal average dissimilarity. After finding the medoid, each object is grouped to the nearest medoid, where object i is grouped into cluster P_{ i } when medoid mP_{ i } is the nearest than other medoids.The k number of objects is expected to minimize the objective function of PAM. The objective function is described as follows:$$\displaystyle{ d(i,mP_{i})\mathit{\pounds }d(i,m_{k})\text{for all}\ x = 1,\ldots, k }$$(2)According to Ng (1994), PAM is an expensive algorithm in finding medoid. This is due to its properties that exchange the medoid with other objects until all of the objects meet the requirement as a medoid.$$\displaystyle{ \sum d(i,mp_{i})A }$$(3)

CLARA (Clustering Large Applications)

CLARA used PAM as part of its technique. From a set of data, it produced multiple samples and applies PAM on the samples

CLARANS (Clustering Large Applications based on Randomized Search)

By combining its technique with PAM, CLARANS started the process by searching a graph on each node that has a potential solution. This process produced a set of k medoid. Medoid will be replaced after this process and clusters will be produced. Produced clusters are a neighboring cluster of the existing clustering. In this technique, node will be selected and compared to userdefined number. CLARANS moves to another node neighbor to start the process when the best candidate is found. If not, the local optimum is found, and node will be selected randomly to search a new local optimum.
Hierarchical Algorithms

Agglomerative algorithms: The algorithms produce a decreasing number of clusters in each step. Two nearest clusters will be merged to produce sequences of clustering schemes.

Divisive algorithms: Contrary to the agglomerative algorithms, these algorithms produce an increasing number of clustering each step. Each group is split into two clusters to produce sequences of clustering scheme.

BIRCH

BIRCH by Zhang et al. (1996) uses CFtree as a hierarchical structure to partition a point dataset. BIRCH is also the first algorithm that could handle noise efficiently.

CURE

CURE by Guha et al. (1998) select points from a set of data and then pull them toward the cluster center. To cater the large volume application such as large database, CURE will use the combination of random sampling technique and partition clustering.
DensityBased Algorithms

DBSCAN (DensityBased Spatial Clustering of Applications with Noise)

In DBSCAN (Ester et al. 1996) algorithm, each point in group cluster requires to have at least minimum number of point based on certain radius. This algorithm could handle noise or outliers effectively. For an incremental clustering, DBSCAN is used as a basic clustering algorithm. Efficient insertion and deletion of an object to an existing cluster could also be handled by using DBSCAN.

DENCLUE (DensityBased Clustering)

DENCLUE is suggested by Hinneburg and Keim (1998). This clustering technique is to cluster large database application such as multimedia. The algorithm models the point density analytically. By determining the density attractors, cluster will be easily identified.
GridBased Algorithms

STING (Statistical Information Gridbased method)

STING is proposed by Wang et al. (1997). It divides the space or region into several rectangular cells based on hierarchical structure. Statistical parameters (i.e., min, max, mean, etc.) are used to calculate numerical feature of each object in the cell. Then clustering information is represented based on the hierarchical structure of the grid cell. This clustering approach offers the efficiency of search queries.

WaveCluster

WaveCluster is invented by Sheikholeslami et al. (2000). This algorithm is invented from signal processing and frequency domain. The process started by imposing multidimensional grid structure onto the space. Information is represented by grid cell and will be transformed using wavelet transformation. To find the cluster, a dense region in the transformation domain needs to be identified.
Scientific Fundamental
3D geospatial data are expected to be the core of spatial data in the near future. This is due to the increasing demand of 3D geospatial application and the state of the art of 3D spatial data capturing such as LiDAR (light detection and ranging), UAV (unmanned aerial vehicle), and TLS (terrestrial laser scanning). The application of 3D data provides a better understanding of realworld environment for its realistic visualization. However, the issues of data management arise when data need to be constellated in the database system. One of the issues is the volume size of 3D geospatial data. The size of 3D geospatial data is large compared with 2D due to the geometric detail attached to it and other information such as image, attribute, etc. Thus, a bigger space and disk size is needed to store 3D geospatial data. For example, produced 3D geospatial data for an urban area using laser scanning techniques require up to 63 GB disk space (Wand et al. 2007). For 3D urban dataset, the volume size is usually large due to the high building density.
Massive 3D geospatial dataset would be very complex to be constellated in the database system. Thus, data model is used as a guideline to manage all these data. By using data model, geospatial data will be transformed into a set of rows and records in the database. This dataset is then retrieved, processed, and analyzed to transform it into valuable information. However, due to the large volume of geospatial data, performance of data retrieval is easily deteriorated during query operation due to the inspection and examination process of each row and record in the database. In some applications, performance of data retrieval is very important. For example, in business service application, retrieving customer information on the specific time is important for efficient delivery service. For servicebased business, punctuality is very important for company reputation. Fast data retrieval is also important for emergency response application such as hospital and fire station. In this case, time management is very important because each of every second is meaningful.
Since time is very important for data retrieval, a specific technique is required to boost up the performance during query operation. In spatial database, spatial access method is used to support efficient spatial selection, especially for range queries, map overlay, spatial analysis, and spatial join. However, without spatial indexing, full table scans need to be performed in order to meet spatial selection criterion. Therefore, spatial indexing is required to address object efficiently without examining every row and record. In spatial database, the development of 2D spatial indexing is well established compared to 3D counterpart. 2D spatial index structures are not the best fit solution to be used for 3D geospatial data since the data types and relationships between objects are defined differently than in 2D. Until now, a wellestablished index structure for 3D spatial information is still an open research problem. Thus, a dedicated index structure for 3D geospatial information is significant for efficient data retrieval.
The effort of developing 3D spatial indexing could be seen in several researches and studies; see Wang and Guo (2012), Gong et al. (2009), Zhu et al. (2007), Deren et al. (2004), and Zlatanova (2000). Based on those studies and reviews, most researchers agree that the transition of 2D Rtree structure to 3D Rtree would be a starting point toward a promising 3D spatial index structure. Rtree index structure was invented by Guttman in (1984). It is a simple data structure that bounded objects with minimum bounding rectangle (MBR). The structure of 3D Rtree and original Rtree is not much different even after the transition of its dimensionality. However, when the Rtree is extended to the third dimension (3D Rtree), the minimum bounding volume (MBV) between nodes is frequently overlap. In certain case, MBV of node could also be covered by the other MBV. Overlapping node is the main reason for the low efficiency of query performance due to multipath query and replicated data entry.
In several cases of urban application such as realtime application, geospatial data or urban object is frequently updated. Thus, rows or records in the database will be modified through the process of data updating such as insert, delete, and update. This process is actually affecting the index structure of 3D Rtree. In certain case, nodes in the tree structure are overflown with M +1 entries or underflow with n ¡ minimum entry, m. In these cases, nodes may need to be merged with other node or split using splitting operation. Splitting operation is the most critical process for Rtree index structure (Fu et al. 2002; Liu et al. 2009; Korotkov 2012; Sleit and AlNsour 2014). At this phase the tree structure will be altered, and, at the same time, it should produce minimal overlapping node, minimal coverage area, and minimal tree height. These issues become critical when it comes to 3D. The minimization of overlap coverage of MBV is more complex, and the splitting operation requires a different approach than in 2D.
Crisp clustering considers nonoverlapping partitions in its approach. Thus, each object either belongs to or not to a class. This characteristic is suitable with the aim of Rtree (Guttman 1984) structure which is an object that will be appeared only once in an index node. The idea is to cluster 3D geospatial data based on classes. Each class represents a node or MBV of 3D Rtree. This approach is different with respect to the original Rtree approach, and it is expected to produce better result of 3D Rtree structure.
Among the crisp clustering techniques, kmeans is the most widely used clustering in various applications. However, there is a function in kmeans that is NP (nondeterministic polynomial time) hard problem that causes this clustering approach to have more than one cluster center in the same group. Having more than one cluster center in the same group can cause a serious overlapping node since the cluster center is not evenly spread.
 Input:

a set of 3D vector data P = {p_{1}, p_{2}, …, p_{ n }} ∈ ℝ^{ d }
 Step 1:

Choose initial center C_{1}
 Step 2:
 Choose a new center C_{ i }, by choosing p ∈ P with probability$$\displaystyle{ \frac{D(p)^{2}} {\sum _{p\in p}D(p)^{2}} }$$(4)
 Step 3:

Step 2 is continued until k centers C_{1}, … …, C_{ k } are chosen
 Step 4:

Proceed with the standard kmeans approach
The proposed kmeans++ crisp clustering algorithm is proved to produce a better version of 3D Rtree compared to kmeans approach. In this paper, we adopted this clustering approach to be utilized in the construction of 3D Rtree as well as for its splitting operation of the overflown node N with M + 1 entry.
Key Application
Collision Detection
Collision detection is important in many computer graphics and visualization. Usually classical hierarchical traversal scheme is used for collision detection. However, problem arises while utilizing this approach such as visiting node more than once and the transformation of node into local coordinate system (Figueiredo et al. 2010). As a consequence, the performance for the collision detection process will be deteriorated. By using 3D Rtree based on the crisp clustering approach, the process of finding collision detection would be very efficient without visiting node repetitively.
RealTime Application
Realtime application such as invehicle satellite navigation or webbased system is exposed with active data updating operation such as updated coordinate information and number of online users information. To retrieve a set of data within a specific time, a performance booster such as 3D Rtree spatial indexing could be used for this application. Frequent data updating process needs an efficient index structure with minimal overlap.
Point Cloud Data Management
Dealing with millions of point cloud data collected from airborne sensors or terrestrial laser scanner often creates many problems in data management and visualization. In this case, spatial indexing is used to retrieve points efficiently from a huge and massive dataset. One of the famous spatial indexing techniques used for this application is Rtree index structure. However, Rtree suffers with serious overlap among nodes, which could cause multipath query and deteriorates the performance of data retrieval. By using the crisp clustering algorithm, the risk of having multipath query could be reduced and increase the efficiency of search and query operation toward a massive point cloud collection.
Future Directions
Based on our observation, 3D Rtree has its own limitation during the data updating operation. Whenever the updating process occurs, such as insert operation or delete operation, the tree structure needs to be revised and all nodes including root node need to be modified. This cost may be significant for the frequent update application or moving objects. Besides that, it also could reduce the processing time and increase the performance efficiency. Thus, a special technique in handling data updating using Rtree without the revision of its structure would be a very interesting topic for future directions of this study.
CrossReferences
References
 Ang CH, Tan TC (1997) New linear node splitting algorithm for Rtrees. In: Scholl M, Voisard A (eds) Advances in spatial databases, vol 1262. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 337–349. doi:10.1007/3540632387_38CrossRefGoogle Scholar
 Arthur D, Vassilvitskii S (2007) kmeans++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACMSIAM symposium on discrete algorithms, New Orleans. Society for Industrial and Applied Mathematics, pp 1027–1035MATHGoogle Scholar
 Deren L, Qing Z, Qiang L, Peng X (2004) From 2D and 3D GIS for CyberCity. GeoSpat Inf Sci 7(1):1–5. doi:10.1007/bf02826668CrossRefGoogle Scholar
 Ester M, Kriegel HP, Sander J, Xu X (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise. Paper presented at the proceeding of 2nd international conference on knowledge discovery and data mining, PortlandGoogle Scholar
 Figueiredo M, Oliveira J, Araújo B, Pereira J (2010) An efficient collision detection algorithm for point cloud models. In: 20th international conference on computer graphics and vision, Warsaw. Citeseer, p 44Google Scholar
 Fu Y, Teng JC, Subramanya S (2002) Node splitting algorithms in treestructured highdimensional indexes for similarity search. In: Proceedings of the 2002 ACM symposium on applied computing, Madrid. ACM, pp 766–770Google Scholar
 Gong J, Ke S, Li X, Qi S (2009) A hybrid 3D spatial access method based on quadtrees and Rtrees for globe data. 74920R–74920R. doi:10.1117/12.837594Google Scholar
 Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. SIGMOD Rec 27(2):73–84. doi:10.1145/276305.276312CrossRefMATHGoogle Scholar
 Guttman A (1984) Rtrees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2):47–57. doi:10.1145/971697.602266CrossRefGoogle Scholar
 Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. Paper presented at the proceedings of the 4th ACM SIGKDD, New YorkGoogle Scholar
 Korotkov A (2012) A new double sortingbased node splitting algorithm for Rtree. Programm Comput Softw 38(3):109–118MathSciNetCrossRefGoogle Scholar
 Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: Proceeding of sixth international symposium Hungarian researchers on computational intelligence (CINTI), Barcelona. Citeseer,Google Scholar
 Liu Y, Fang J, Han C (2009) A new Rtree node splitting algorithm using MBR partition policy. In: 2009 17th international conference on geoinformatics, Fairfax. IEEE, pp 1–6Google Scholar
 MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Berkeley, p 14Google Scholar
 Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th VLDB conference, SantiagoGoogle Scholar
 Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a waveletbased clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304. doi:10.1007/s007780050009CrossRefGoogle Scholar
 Sleit A, AlNsour E (2014) Cornerbased splitting: an improved node splitting algorithm for Rtree. J Inf Sci. doi:10.1177/0165551513516709Google Scholar
 Theodoridis S, Koutroumbas K (2009) Chapter 13 – clustering algorithms II: hierarchical algorithms. In: Theodoridis S, Koutroumbas K (eds) Pattern recognition, 4th edn. Academic, Boston, pp 653–700. doi:http://dx.doi.org/10.1016/B9781597492720.500153
 Wand M, Berner A, Bokeloh M, Fleck A, Hoffmann M, Jenke P, Maier B, Staneker D, Schilling A (2007) Interactive editing of large point clouds. In: SPBG, Prague, pp 37–45Google Scholar
 Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Paper presented at the proceedings of the 23rd international conference on very large data bases, AthensGoogle Scholar
 Wang Y, Guo M (2012) An integrated spatial indexing of huge point image model. In: Paper presented at the international archives of the photogrammetry, remote sensing and spatial information Sciences, Melbourne, 25 Aug–01 Sept 2012Google Scholar
 Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114. doi:10.1145/235968.233324CrossRefGoogle Scholar
 Zhu Q, Gong J, Zhang Y (2007) An efficient 3D Rtree spatial index method for virtual geographic environments. ISPRS J Photogramm Remote Sens 62(3):217–224. doi:http://dx.doi.org/10.1016/j.isprsjprs.2007.05.007
 Zlatanova S (2000) 3D GIS for urban development. International Institute for Aerospace Survey and Earth Sciences (ITC)Google Scholar