Encyclopedia of GIS

2017 Edition
| Editors: Shashi Shekhar, Hui Xiong, Xun Zhou

3D Crisp Clustering of Geo-Urban Data

  • Suhaibah Azri
  • Alias Abdul Rahman
  • Uznir Ujang
  • François Anton
  • Darka Mioc
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-17885-1_1610

Synonyms

Definition

Crisp clustering is a technique to cluster objects into group without having overlapping partitions. Each data point is either belongs to or not to a group. Most of the clustering algorithms are categorized as crisp clustering. There are several categories of crisp clustering algorithm such as partitional algorithm, hierarchical algorithm, density-based algorithm, and grid-based algorithm. The general definition of each group could be defined as follows (Kovács et al. 2005):
  • Partitional algorithms: divide the data into a set of separate category. This algorithm attempts to define the number of partitions to optimize a certain criterion function. This optimization is an iterative procedure.

  • Hierarchical algorithms: This algorithm creates clusters repeatedly by merging a small cluster into a larger cluster. It also split cluster into several small classes.

  • Density-based algorithms: By using this technique, clusters are generated based on its density function and produced arbitrary shaped clusters.

  • Grid-based algorithms: These types of algorithms are widely used for the application of spatial data mining. The search space is quantized into a finite number of cells.

Historical Background

The crisp clustering algorithm has been used ubiquitously in many fields and areas such as web mining, spatial data analysis, business, prediction based on groups, and much more. In the past few years, a number of algorithms have been invented and proposed for various applications. These algorithms can be represented based on its categories as follows.

Partitional Algorithms

  • k-means

  • k-means is the most widely used crisp clustering algorithms in various applications such as machine learning, statistical analysis, and computer visualization. k-means was invented by MacQueen in 1967 to deal with the problem of data clustering (MacQueen 1967). The aim of this clustering technique is to optimize the objective function which can be described as follows:
    $$\displaystyle{ E ={\sum _{ i=1}^{c}}{\sum _{ x\in c_{i}}d(x,mi) }}$$
    (1)
    From the Eq. (1), the cluster center of C i is mi, while d is the distance from point x to point m i . In the equation, the criterion of function E will minimize the distance between point and cluster center. A set of C cluster centers was chosen at the initial step. Then, each object is assigned to the nearest cluster center. The centers are then recomputed, and the process continues until the cluster center stops changing.
  • PAM (Partitioning Around Medoid)

  • This algorithm attempts to find the medoid for each cluster. The algorithm starts by searching the nearest objects that are located in the cluster. The algorithm of PAM first will compute a k representative object which is a medoid. A medoid is an object that has a very minimal average dissimilarity. After finding the medoid, each object is grouped to the nearest medoid, where object i is grouped into cluster P i when medoid mP i is the nearest than other medoids.
    $$\displaystyle{ d(i,mP_{i})\mathit{\pounds }d(i,m_{k})\text{for all}\ x = 1,\ldots, k }$$
    (2)
    The k number of objects is expected to minimize the objective function of PAM. The objective function is described as follows:
    $$\displaystyle{ \sum d(i,mp_{i})A }$$
    (3)
    According to Ng (1994), PAM is an expensive algorithm in finding medoid. This is due to its properties that exchange the medoid with other objects until all of the objects meet the requirement as a medoid.
  • CLARA (Clustering Large Applications)

  • CLARA used PAM as part of its technique. From a set of data, it produced multiple samples and applies PAM on the samples

  • CLARANS (Clustering Large Applications based on Randomized Search)

  • By combining its technique with PAM, CLARANS started the process by searching a graph on each node that has a potential solution. This process produced a set of k medoid. Medoid will be replaced after this process and clusters will be produced. Produced clusters are a neighboring cluster of the existing clustering. In this technique, node will be selected and compared to user-defined number. CLARANS moves to another node neighbor to start the process when the best candidate is found. If not, the local optimum is found, and node will be selected randomly to search a new local optimum.

Hierarchical Algorithms

Hierarchical algorithm is a method that produces the hierarchy of clustering. The application of these clustering approaches could be found in various fields such as modern biology, biological taxonomy, as well as computer science and engineering. According to Theodoridis and Koutroumbas (2009), hierarchical algorithms could be divided into two categories:
  • Agglomerative algorithms: The algorithms produce a decreasing number of clusters in each step. Two nearest clusters will be merged to produce sequences of clustering schemes.

  • Divisive algorithms: Contrary to the agglomerative algorithms, these algorithms produce an increasing number of clustering each step. Each group is split into two clusters to produce sequences of clustering scheme.

The example of some hierarchical based algorithm could be described as follows:
  • BIRCH

  • BIRCH by Zhang et al. (1996) uses CF-tree as a hierarchical structure to partition a point dataset. BIRCH is also the first algorithm that could handle noise efficiently.

  • CURE

  • CURE by Guha et al. (1998) select points from a set of data and then pull them toward the cluster center. To cater the large volume application such as large database, CURE will use the combination of random sampling technique and partition clustering.

Density-Based Algorithms

This type of algorithm considers a cluster as a region in the n-dimensional space. Most of these algorithms do not enforce any restriction to the produced result. It has the ability to handle the outliers. The time complexity is O(N2) which is suitable for large data processing.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  • In DBSCAN (Ester et al. 1996) algorithm, each point in group cluster requires to have at least minimum number of point based on certain radius. This algorithm could handle noise or outliers effectively. For an incremental clustering, DBSCAN is used as a basic clustering algorithm. Efficient insertion and deletion of an object to an existing cluster could also be handled by using DBSCAN.

  • DENCLUE (Density-Based Clustering)

  • DENCLUE is suggested by Hinneburg and Keim (1998). This clustering technique is to cluster large database application such as multimedia. The algorithm models the point density analytically. By determining the density attractors, cluster will be easily identified.

Grid-Based Algorithms

Grid-based algorithm is the clustering technique that quantizes a space or region into a finite number of cells. Recently, this type of clustering has been used increasingly in spatial application:
  • STING (Statistical Information Grid-based method)

  • STING is proposed by Wang et al. (1997). It divides the space or region into several rectangular cells based on hierarchical structure. Statistical parameters (i.e., min, max, mean, etc.) are used to calculate numerical feature of each object in the cell. Then clustering information is represented based on the hierarchical structure of the grid cell. This clustering approach offers the efficiency of search queries.

  • WaveCluster

  • WaveCluster is invented by Sheikholeslami et al. (2000). This algorithm is invented from signal processing and frequency domain. The process started by imposing multidimensional grid structure onto the space. Information is represented by grid cell and will be transformed using wavelet transformation. To find the cluster, a dense region in the transformation domain needs to be identified.

Scientific Fundamental

3D geospatial data are expected to be the core of spatial data in the near future. This is due to the increasing demand of 3D geospatial application and the state of the art of 3D spatial data capturing such as LiDAR (light detection and ranging), UAV (unmanned aerial vehicle), and TLS (terrestrial laser scanning). The application of 3D data provides a better understanding of real-world environment for its realistic visualization. However, the issues of data management arise when data need to be constellated in the database system. One of the issues is the volume size of 3D geospatial data. The size of 3D geospatial data is large compared with 2D due to the geometric detail attached to it and other information such as image, attribute, etc. Thus, a bigger space and disk size is needed to store 3D geospatial data. For example, produced 3D geospatial data for an urban area using laser scanning techniques require up to 63 GB disk space (Wand et al. 2007). For 3D urban dataset, the volume size is usually large due to the high building density.

Massive 3D geospatial dataset would be very complex to be constellated in the database system. Thus, data model is used as a guideline to manage all these data. By using data model, geospatial data will be transformed into a set of rows and records in the database. This dataset is then retrieved, processed, and analyzed to transform it into valuable information. However, due to the large volume of geospatial data, performance of data retrieval is easily deteriorated during query operation due to the inspection and examination process of each row and record in the database. In some applications, performance of data retrieval is very important. For example, in business service application, retrieving customer information on the specific time is important for efficient delivery service. For service-based business, punctuality is very important for company reputation. Fast data retrieval is also important for emergency response application such as hospital and fire station. In this case, time management is very important because each of every second is meaningful.

Since time is very important for data retrieval, a specific technique is required to boost up the performance during query operation. In spatial database, spatial access method is used to support efficient spatial selection, especially for range queries, map overlay, spatial analysis, and spatial join. However, without spatial indexing, full table scans need to be performed in order to meet spatial selection criterion. Therefore, spatial indexing is required to address object efficiently without examining every row and record. In spatial database, the development of 2D spatial indexing is well established compared to 3D counterpart. 2D spatial index structures are not the best fit solution to be used for 3D geospatial data since the data types and relationships between objects are defined differently than in 2D. Until now, a well-established index structure for 3D spatial information is still an open research problem. Thus, a dedicated index structure for 3D geospatial information is significant for efficient data retrieval.

The effort of developing 3D spatial indexing could be seen in several researches and studies; see Wang and Guo (2012), Gong et al. (2009), Zhu et al. (2007), Deren et al. (2004), and Zlatanova (2000). Based on those studies and reviews, most researchers agree that the transition of 2D R-tree structure to 3D R-tree would be a starting point toward a promising 3D spatial index structure. R-tree index structure was invented by Guttman in (1984). It is a simple data structure that bounded objects with minimum bounding rectangle (MBR). The structure of 3D R-tree and original R-tree is not much different even after the transition of its dimensionality. However, when the R-tree is extended to the third dimension (3D R-tree), the minimum bounding volume (MBV) between nodes is frequently overlap. In certain case, MBV of node could also be covered by the other MBV. Overlapping node is the main reason for the low efficiency of query performance due to multipath query and replicated data entry.

In several cases of urban application such as real-time application, geospatial data or urban object is frequently updated. Thus, rows or records in the database will be modified through the process of data updating such as insert, delete, and update. This process is actually affecting the index structure of 3D R-tree. In certain case, nodes in the tree structure are overflown with M +1 entries or underflow with n ¡ minimum entry, m. In these cases, nodes may need to be merged with other node or split using splitting operation. Splitting operation is the most critical process for R-tree index structure (Fu et al. 2002; Liu et al. 2009; Korotkov 2012; Sleit and Al-Nsour 2014). At this phase the tree structure will be altered, and, at the same time, it should produce minimal overlapping node, minimal coverage area, and minimal tree height. These issues become critical when it comes to 3D. The minimization of overlap coverage of MBV is more complex, and the splitting operation requires a different approach than in 2D.

Crisp clustering considers non-overlapping partitions in its approach. Thus, each object either belongs to or not to a class. This characteristic is suitable with the aim of R-tree (Guttman 1984) structure which is an object that will be appeared only once in an index node. The idea is to cluster 3D geospatial data based on classes. Each class represents a node or MBV of 3D R-tree. This approach is different with respect to the original R-tree approach, and it is expected to produce better result of 3D R-tree structure.

Among the crisp clustering techniques, k-means is the most widely used clustering in various applications. However, there is a function in k-means that is NP (non-deterministic polynomial time) hard problem that causes this clustering approach to have more than one cluster center in the same group. Having more than one cluster center in the same group can cause a serious overlapping node since the cluster center is not evenly spread.

In order to overcome this issue, we proposed the new addition of improved k-means crisp clustering algorithm, k-means++. Back in 2007, Arthur and Vassilvitskii (2007) introduced the approach of careful seeding to improve k-means algorithm. By using this approach, initial seeds are defined and then the remaining objects are clustered based on the nearest distance to the initial seeds. This algorithm is proven to yield improvement in terms of accuracy of its original algorithm. The cluster centers are evenly spread compared to k-means algorithm. In this paper, the algorithm of k-means++ is expanded to 3D for the urban data purposes. The description of this algorithm could be described as follows:
Input:

a set of 3D vector data P = {p1, p2, , p n } ∈ d

Step 1:

Choose initial center C1

Step 2:
Choose a new center C i , by choosing pP with probability
$$\displaystyle{ \frac{D(p)^{2}} {\sum _{p\in p}D(p)^{2}} }$$
(4)
Step 3:

Step 2 is continued until k centers C1, … …, C k are chosen

Step 4:

Proceed with the standard k-means approach

The proposed k-means++ crisp clustering algorithm is proved to produce a better version of 3D R-tree compared to k-means approach. In this paper, we adopted this clustering approach to be utilized in the construction of 3D R-tree as well as for its splitting operation of the overflown node N with M + 1 entry.

The workflow of 3D R-tree based on proposed crisp clustering approach is illustrated in Fig. 1. By using this workflow, a set of 3D objects is tested. There are 200 objects (n = 200) which have been clustered in this test as shown in Fig. 2. The maximum entry M for each node is set to 25 which means only 25 objects are allowed in each MBV. As a result, objects are grouped into eight classes: P, Q, R, S, T, U, V, and W. However, among these groups there are three MBVs (R, V, and W) exceeding an M number of entries. Thus, MBVs R, V, and W are qualified for the next cycle. In the second cycle, each node will be split and divided into two subgroups of R (Sub-R1and Sub-R2), V (Sub-V1and Sub-V2), and W (Sub-W1and Sub-W2).
3D Crisp Clustering of Geo-Urban Data, Fig. 1

3D R-tree workflow

3D Crisp Clustering of Geo-Urban Data, Fig. 2

Clustered objects using crisp clustering

To evaluate the efficiency of the proposed approach in constructing and producing efficient structure of 3D R-tree, a set of 3D vector data are tested in this experiment. The datasets are from 3D volumetric objects (i.e., 3D building). For the first experiment, a set of 500 buildings in an urban area as represented in Fig. 3 are clustered based on the proposed approach. The input data of these 3D buildings are based on LoD 2 (Level of Detail) of CityGML format. The cluster classes for this dataset are set to 20 with maximum entry M where M is set to 25 for each class. Classes are then formed into MBV of 3D R-tree. The result of this experiment is then compared with the original R-tree and original k-means crisp clustering. Figure 3 shows the comparison of overlapping percentage and coverage percentage of the proposed approach with other approaches.
3D Crisp Clustering of Geo-Urban Data, Fig. 3

Comparison of overlap percentage and coverage percentage

The same dataset in Fig. 3 is tested with for node splitting operation. As mentioned in the previous section, splitting operation of 3D R-tree should preserve minimal overlapping among node, minimal coverage area, as well as tree height. In this test three different splitting approaches are used for comparison purposes such as new linear (Ang and Tan 1997), exhaustive R-tree (Guttman 1984), and crisp clustering. From the Fig. 4, the percentage of total overlap between nodes indicates that crisp clustering offers a minimal percentage which is 20%. Meanwhile, the percentage for original exhaustive R-tree is 97% and new linear 88%.
3D Crisp Clustering of Geo-Urban Data, Fig. 4

Percentage of overlap using different approaches

Key Application

Collision Detection

Collision detection is important in many computer graphics and visualization. Usually classical hierarchical traversal scheme is used for collision detection. However, problem arises while utilizing this approach such as visiting node more than once and the transformation of node into local coordinate system (Figueiredo et al. 2010). As a consequence, the performance for the collision detection process will be deteriorated. By using 3D R-tree based on the crisp clustering approach, the process of finding collision detection would be very efficient without visiting node repetitively.

Real-Time Application

Real-time application such as in-vehicle satellite navigation or web-based system is exposed with active data updating operation such as updated coordinate information and number of online users information. To retrieve a set of data within a specific time, a performance booster such as 3D R-tree spatial indexing could be used for this application. Frequent data updating process needs an efficient index structure with minimal overlap.

Point Cloud Data Management

Dealing with millions of point cloud data collected from airborne sensors or terrestrial laser scanner often creates many problems in data management and visualization. In this case, spatial indexing is used to retrieve points efficiently from a huge and massive dataset. One of the famous spatial indexing techniques used for this application is R-tree index structure. However, R-tree suffers with serious overlap among nodes, which could cause multipath query and deteriorates the performance of data retrieval. By using the crisp clustering algorithm, the risk of having multipath query could be reduced and increase the efficiency of search and query operation toward a massive point cloud collection.

Future Directions

Based on our observation, 3D R-tree has its own limitation during the data updating operation. Whenever the updating process occurs, such as insert operation or delete operation, the tree structure needs to be revised and all nodes including root node need to be modified. This cost may be significant for the frequent update application or moving objects. Besides that, it also could reduce the processing time and increase the performance efficiency. Thus, a special technique in handling data updating using R-tree without the revision of its structure would be a very interesting topic for future directions of this study.

Cross-References

References

  1. Ang CH, Tan TC (1997) New linear node splitting algorithm for R-trees. In: Scholl M, Voisard A (eds) Advances in spatial databases, vol 1262. Lecture notes in computer science. Springer, Berlin/Heidelberg, pp 337–349. doi:10.1007/3-540-63238-7_38CrossRefGoogle Scholar
  2. Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, New Orleans. Society for Industrial and Applied Mathematics, pp 1027–1035MATHGoogle Scholar
  3. Deren L, Qing Z, Qiang L, Peng X (2004) From 2D and 3D GIS for CyberCity. Geo-Spat Inf Sci 7(1):1–5. doi:10.1007/bf02826668CrossRefGoogle Scholar
  4. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Paper presented at the proceeding of 2nd international conference on knowledge discovery and data mining, PortlandGoogle Scholar
  5. Figueiredo M, Oliveira J, Araújo B, Pereira J (2010) An efficient collision detection algorithm for point cloud models. In: 20th international conference on computer graphics and vision, Warsaw. Citeseer, p 44Google Scholar
  6. Fu Y, Teng J-C, Subramanya S (2002) Node splitting algorithms in tree-structured high-dimensional indexes for similarity search. In: Proceedings of the 2002 ACM symposium on applied computing, Madrid. ACM, pp 766–770Google Scholar
  7. Gong J, Ke S, Li X, Qi S (2009) A hybrid 3D spatial access method based on quadtrees and R-trees for globe data. 74920R–74920R. doi:10.1117/12.837594Google Scholar
  8. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. SIGMOD Rec 27(2):73–84. doi:10.1145/276305.276312CrossRefMATHGoogle Scholar
  9. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2):47–57. doi:10.1145/971697.602266CrossRefGoogle Scholar
  10. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. Paper presented at the proceedings of the 4th ACM SIGKDD, New YorkGoogle Scholar
  11. Korotkov A (2012) A new double sorting-based node splitting algorithm for R-tree. Programm Comput Softw 38(3):109–118MathSciNetCrossRefGoogle Scholar
  12. Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: Proceeding of sixth international symposium Hungarian researchers on computational intelligence (CINTI), Barcelona. Citeseer,Google Scholar
  13. Liu Y, Fang J, Han C (2009) A new R-tree node splitting algorithm using MBR partition policy. In: 2009 17th international conference on geoinformatics, Fairfax. IEEE, pp 1–6Google Scholar
  14. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Berkeley, p 14Google Scholar
  15. Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th VLDB conference, SantiagoGoogle Scholar
  16. Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304. doi:10.1007/s007780050009CrossRefGoogle Scholar
  17. Sleit A, Al-Nsour E (2014) Corner-based splitting: an improved node splitting algorithm for R-tree. J Inf Sci. doi:10.1177/0165551513516709Google Scholar
  18. Theodoridis S, Koutroumbas K (2009) Chapter 13 – clustering algorithms II: hierarchical algorithms. In: Theodoridis S, Koutroumbas K (eds) Pattern recognition, 4th edn. Academic, Boston, pp 653–700. doi:http://dx.doi.org/10.1016/B978-1-59749-272-0.50015-3
  19. Wand M, Berner A, Bokeloh M, Fleck A, Hoffmann M, Jenke P, Maier B, Staneker D, Schilling A (2007) Interactive editing of large point clouds. In: SPBG, Prague, pp 37–45Google Scholar
  20. Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Paper presented at the proceedings of the 23rd international conference on very large data bases, AthensGoogle Scholar
  21. Wang Y, Guo M (2012) An integrated spatial indexing of huge point image model. In: Paper presented at the international archives of the photogrammetry, remote sensing and spatial information Sciences, Melbourne, 25 Aug–01 Sept 2012Google Scholar
  22. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114. doi:10.1145/235968.233324CrossRefGoogle Scholar
  23. Zhu Q, Gong J, Zhang Y (2007) An efficient 3D R-tree spatial index method for virtual geographic environments. ISPRS J Photogramm Remote Sens 62(3):217–224. doi:http://dx.doi.org/10.1016/j.isprsjprs.2007.05.007
  24. Zlatanova S (2000) 3D GIS for urban development. International Institute for Aerospace Survey and Earth Sciences (ITC)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Suhaibah Azri
    • 1
  • Alias Abdul Rahman
    • 1
  • Uznir Ujang
    • 1
  • François Anton
    • 2
  • Darka Mioc
    • 3
  1. 1.Department of Geoinformation 3D GIS Research LabUniversiti Teknologi MalaysiaJohor BahruMalaysia
  2. 2.Department of GeodesyTechnical University of DenmarkLyngbyDenmark
  3. 3.Department of GeodesyTechnical University of DenmarkLyngbyDenmark