An incremental algorithm for clustering spatial data streams: exploring temporal locality

Wei, Ling-Yin; Peng, Wen-Chih

doi:10.1007/s10115-013-0636-8

An incremental algorithm for clustering spatial data streams: exploring temporal locality

Regular Paper
Published: 26 April 2013

Volume 37, pages 453–483, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ling-Yin Wei^1,2 &
Wen-Chih Peng¹

609 Accesses
10 Citations
Explore all metrics

Abstract

Clustering sensor data discovers useful information hidden in sensor networks. In sensor networks, a sensor has two types of attributes: a geographic attribute (i.e, its spatial location) and non-geographic attributes (e.g., sensed readings). Sensor data are periodically collected and viewed as spatial data streams, where a spatial data stream consists of a sequence of data points exhibiting attributes in both the geographic and non-geographic domains. Previous studies have developed a dual clustering problem for spatial data by considering similarity-connected relationships in both geographic and non-geographic domains. However, the clustering processes in stream environments are time-sensitive because of frequently updated sensor data. For sensor data, the readings from one sensor are similar for a period, and the readings refer to temporal locality features. Using the temporal locality features of the sensor data, this study proposes an incremental clustering (IC) algorithm to discover clusters efficiently. The IC algorithm comprises two phases: cluster prediction and cluster refinement. The first phase estimates the probability of two sensors belonging to a cluster from the previous clustering results. According to the estimation, a coarse clustering result is derived. The cluster refinement phase then refines the coarse result. This study evaluates the performance of the IC algorithm using synthetic and real datasets. Experimental results show that the IC algorithm outperforms exiting approaches confirming the scalability of the IC algorithm. In addition, the effect of temporal locality features on the IC algorithm is analyzed and thoroughly examined in the experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

K-DBSCAN: An improved DBSCAN algorithm for big data

Article 26 November 2020

Silhouette Index as Clustering Evaluation Tool

References

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases (VLDB), pp 81–92
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Article Google Scholar
Aghabozorgi SR, Saybani MR, Wah TY (2012) Incremental clustering of time-series by fuzzy clustering. J Inf Sci Eng (JISE) 28(4):671–688
MathSciNet Google Scholar
Alïtelhadj A, Boughanem M, Mezghiche M, Souam F (2012) Using structural similarity for clustering xml documents. Knowl Inf Syst (KAIS) 32(1):109–139
Article Google Scholar
Bagnall AJ, Janacek GJ (2004) Clustering time series from arma models with clipped data. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 49–58
Banerjee A, Ghosh J (2006) Scalable clustering algorithms with balancing constraints. Data Min Knowl Discov 13(3):365–395
Article MathSciNet Google Scholar
Beringer J, HLullermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng (DKE) 58(2):180–204
Article Google Scholar
Borovkova S, Permana FJ (2004) Modelling electricity prices by the potential jumpdiffusion. In: Proceedings of the autumn school and international conference on stochastic finance (StochFin), pp 239–263
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining (SDM), pp 326–337
Cartea A, Figueroa MG (2005) Pricing in electricity markets a mean reverting jump diffusion model with seasonality. Appl Math Finance 12(4):313–335
Article MATH Google Scholar
Costa G, Manco G, Ortale R (2010) Density-based clustering of data streams at multiple resolutions. Data Min Knowl Discov (DMKD) 20(1):152–187
Article MathSciNet Google Scholar
Dai BR, Lin CR, Chen MS (2007) Constrained data clustering by depth control and progressive constraint relaxation. Int J Very Large Data Bases 16(2):201–217
Article Google Scholar
Davidson I, Ester M, Ravi SS (2007) Efficient incremental constrained clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), pp 240–249
Davidson I, Ravi SS (2005) Clustering with constraints: Feasibility issues and the k-means algorithm. In: Proceedings of the fifth SIAM international conference on data mining (SDM), pp 138–149
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 323–333
Ge R, Ester M, Gao BJ, Hu Z, Bhattacharya B, Ben-Moshe B (2008) Joint cluster analysis of attribute data and relationship data: the connected k-center problem, algorithms and applications. ACM Trans Knowl Discov Data 2(2):7:1–7:35
Article Google Scholar
Ge R, Ester M, Jin W, Davidson I (2007) Constraint-driven clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining(SIGKDD), pp 320–329
Guha S, Meyerson A, Mishra N, Motwani R, OCallaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 15(3):515–528
Article Google Scholar
Guo L, Ai C, Wang X, Cai Z, Li Y (2009) Real time clustering of sensory data in wireless sensor networks. In: Proceedings of the 28th IEEE international performance computing and communications conference (IPCCC), pp 33–40
Halkidi M, Spiliopoulou M, Pavlou A (2012) A semi-supervised incremental clustering algorithm for streaming data. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 578–590
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, New York
Google Scholar
Huang J, Zhang J (2010) Distributed dual cluster algorithm based on grid for sensor streams. Int J Digit Content Technol Appl (JDCTA) 4(9):225–233
Article Google Scholar
Kavitha V, Punithavalli M (2010) Clustering time series data stream—a literature survey. ACM Trans Knowl Discov Data (IJCSIS) 8(1):289–294
Google Scholar
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning (ICML), pp 307–314
Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowl Inf Syst 31(1):153–170
Article Google Scholar
Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimization and constraint domains. IEEE Trans Knowl Data Eng 17(5):628–637
Article Google Scholar
Lin J, Vlachos M, Keogh EJ, Gunopulos D (2004) Iterative incremental clustering of time series. In: Proceedings of the ninth international conference on extending database technology (EDBT), pp 106V122
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng (DKE) 68(1):1–27
Article Google Scholar
Mirkin B (2011) Clustering for data mining: a data recovery approach. Taylor and Francis, London
Google Scholar
O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering (ICDE), pp 685–694
Pensa RG, Ienco D, Meo R (2012) Hierarchical co-clustering: off-line and incremental approaches. Data Mining Knowl Discov (DMKD). doi:10.1007/s10618-012-0292-8
Robert CP, Casella G (2004) Monte carlo statistical methods. Springer, New York
Book MATH Google Scholar
Rodrigues PP, Gama J, Lopes L (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases—part II (ECML PKDD), pp 282–297
Rodrigues PP, Gama J, Pedroso JP (2008) Hierarchical clustering of time series data streams. IEEE Trans Knowl Data Eng (TKDE) 20(5):615–627
Google Scholar
Shi YB, Yuan CA, Huang Y, Wen YG (2010) A method of spatial clustering based on the combination of the spatial coordinate and attributes. In: Proceedings of the sixth international conference on information systems security (ICISS), pp 526–529
Shi Y, Zhang L (2010) Coid: A clustervoutlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Sys. doi:10.1007/s10115-010-0323-y
Tai CH, Dai BR, Chen MS (2007) Incremental clustering in geography and optimization spaces. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 272–283
Taiwan area national freeway bureau. http://www.freeway.gov.tw/
Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, New York
Google Scholar
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML), pp 1103–1110
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28
Article Google Scholar
Wang JW, Cheng CH (2007) An efficient method for estimating null values in relational databases. Knowl Inf Syst 12(3):379–394
Article Google Scholar
Wei LY, Peng WC (2009) Clustering data streams in optimization and geography domains. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 997–1005
Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 695–697
Zhou J, Guan J, Li P (2007c) Dcad: a dual clustering algorithm for distributed spatial databases. Geo-Spatial Inf Sci 10(2):137–144
Article Google Scholar
Zhou A, Cao F, Yan Y, Sha C, He X (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 133–142
Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: Proceedings of the 23rd international conference on data engineering (ICDE), pp 736–745

Download references

Acknowledgments

We thank anonymous reviewers for their very useful comments and suggestions. Wen-Chih Peng was supported in part by the National Science Council, Project No.100-2218-E-009-016-MY3 and 100-2218-E-009-013-MY3, by TaiwanMoE ATU Program, by HTC and by Delta.

Author information

Authors and Affiliations

Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
Ling-Yin Wei & Wen-Chih Peng
Institute of Information Science, Academia Sinica, Taipei, Taiwan
Ling-Yin Wei

Authors

Ling-Yin Wei
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Chih Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen-Chih Peng.

Appendix A. Hierarchical-based clustering algorithm

A hierarchical-based clustering (HBC) algorithm has been proposed for solving dual clustering problems in spatial data streams [43]. In each time window, the HBC algorithm first constructs an SC-graph and places explicit edges with their dissimilarity values in a priority queue \(Q\), which is a data structure that returns the edge with the minimal dissimilarity value in the set of explicit edges in \(Q\). As with a bottom-up hierarchical clustering algorithm, each vertex is initially regarded as a single cluster. The explicit edge with the minimal dissimilarity value is then removed from the priority queue \(Q\). Let that edge be \(e_{e}(S_{i},S_{j})\). If \(S_{i}\) and \(S_{j}\) belong to the same cluster, there is no need to cluster them. However, if they belong to different clusters (e.g., \(S_{i}\in C_{i}, S_{j}\in C_{j}\), and \(C_{i}\ne C_{j}\)), both the connectivity requirement and the complete subgraph requirement (Section 3.1) should be verified. If these two requirements are met, clusters \(C_{i}\) and \(C_{j}\) are merged to form a new cluster. This procedure is performed iteratively until \(Q\) is empty.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, LY., Peng, WC. An incremental algorithm for clustering spatial data streams: exploring temporal locality. Knowl Inf Syst 37, 453–483 (2013). https://doi.org/10.1007/s10115-013-0636-8

Download citation

Received: 08 September 2011
Revised: 17 February 2013
Accepted: 05 April 2013
Published: 26 April 2013
Issue Date: November 2013
DOI: https://doi.org/10.1007/s10115-013-0636-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An incremental algorithm for clustering spatial data streams: exploring temporal locality

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

K-DBSCAN: An improved DBSCAN algorithm for big data

Silhouette Index as Clustering Evaluation Tool

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix A. Hierarchical-based clustering algorithm

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An incremental algorithm for clustering spatial data streams: exploring temporal locality

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

K-DBSCAN: An improved DBSCAN algorithm for big data

Silhouette Index as Clustering Evaluation Tool

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix A. Hierarchical-based clustering algorithm

Appendix A. Hierarchical-based clustering algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation