Abstract
In the era of the Internet of Things (IoT), the proliferation of interconnected devices and sensors has led to an unprecedented deluge of data. Effective data analysis, particularly clustering, has become pivotal in handling the challenges posed by the vast volumes of IoT data. Clustering evaluation plays a critical role in determining the quality of clustering results. However, traditional cluster validity metrics are ill-suited for the distributed nature of IoT data. To address this gap, we introduce a novel distributed clustering evaluation metric named C4Y. It is rooted in sampling theory and is designed to evaluate the performance of clustering algorithms in distributed IoT environments. It operates based on two key principles: (1) Each dataset within distributed IoT node is treated as a sample of the entire dataset, and the expectation is that each sample exhibits similar data distribution, including category distribution, to the overall dataset. (2) It assumes that the centers of each category in all samples conform to a Gaussian distribution. This metric quantifies the extent to which category centers in different samples adhere to Gaussian distributions and measures the dissimilarity between these categories. Empirical results across various public datasets, spanning diverse sizes and dimensions, demonstrate that C4Y effectively assesses the performance of distributed clustering algorithms. This innovative approach promises to advance data analytics within the realm of distributed IoT data, underpinning the development of sophisticated IoT systems.
Similar content being viewed by others
Data availability
The source code and dataset are avaiable on https://github.com/XFastDataLab/C4Y.
References
Bharti, M., Jindal, H.: Optimized clustering-based discovery framework on internet of things. J. Supercomput. 77, 1739–1778 (2021)
Bhaskara, A., Wijewardena, M.: Distributed clustering via lsh based data partitioning, in: International Conference on Machine Learning, PMLR, 570–579 (2018)
Borlea, I.-D., Precup, R.-E., Borlea, A.-B., Iercan, D.: A unified form of fuzzy c-means and k-means algorithms and its partitional implementation. Knowl.-Based Syst. 214, 106731 (2021)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
Campbell, A., Hariri, M.E., Parvania, M.: Asynchronous distributed iot-enabled customer characterization in distribution networks: Theory and hardware implementation. IEEE Transactions on Smart Grid 13(6), 4392–4404 (2022)
Casolla, G., Cuomo, S., Di Cola, V.S., Piccialli, F.: Exploring unsupervised learning techniques for the internet of things. IEEE Trans. Industr. Inf. 16(4), 2621–2628 (2019)
Chen,Y., Yu,P., Zheng,Z., Shen,J., Guo,M.: Modeling feature interactions for context-aware qos prediction of iot services, Future Generation Computer Systems (2022)
Chen, Y., Hu, X., Fan, W., Shen, L., Zhang, Z., Liu, X., Du, J., Li, H., Chen, Y., Li, H.: Fast density peak clustering for large scale data based on knn. Knowl.-Based Syst. 187, 104824 (2020)
Chen, Y., Shen, L., Zhong, C., Chen, Y., Du, J.: Survey on density peak clustering algorithm. Journal of Computer Research and Development (in Chinese) 57(02), 378–394 (2020)
Chen, Z.-S., Zhang, X., Pedrycz, W., Wang, X.-J., Chin, K.-S., Martínez, L.: K-means clustering for the aggregation of hflts possibility distributions: N-two-stage algorithmic paradigm. Knowl.-Based Syst. 227, 107230 (2021)
Chen, Y., Zhou, L., Pei, S., Yu, Z., Chen, Y., Liu, X., Du, J., Xiong, N.: Knn-block dbscan: Fast clustering for large-scale data. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 3939–3953 (2021)
Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., Du, J.: Block-dbscan: Fast clustering for large scale data. Pattern Recognit 109, 107624 (2021)
Chen, Y., Hailu, C., Yi, C., Zhao, K., Zhen, L., Jixiang, D.: Survey on dbscan acceleration algorithms for large scale data. Journal of Computer Research and Development (in Chinese) 60(09), 2028–2047 (2023)
Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995)
Cheng, D., Zhu, Q., Huang, J., Wu, Q., Yang, L.: A novel cluster validity index based on local cores. IEEE Transactions on Neural Networks and Learning Systems 30(4), 985–999 (2018)
da Silva, L.E.B., Elnabarawy, I., Wunsch, D.C., II.: A survey of adaptive resonance theory neural network models for engineering applications. Neural Netw. 120, 167–203 (2019)
Dang, B., Wang, Y., Zhou, J., Wang, R., Chen, L., Chen, C.L.P., Zhang, T., Han, S., Wang, L., Chen, Y.: Transfer collaborative fuzzy clustering in distributed peer-to-peer networks. IEEE Trans. Fuzzy Syst. 30(2), 500–514 (2022)
Davies,D. L., Bouldin,D. W.: A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence (2) 224–227 (1979)
Ding,S., Li,C., Xu,X., Ding,L., Zhang,J., Guo,L., Shi,T.: A sampling-based density peaks clustering algorithm for large-scale data, Pattern Recognition (2022) 109238
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3(3), 32–57 (1973)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Geng, Y., Li, Q., Liang, M., Chi, C., Tan, J., Huang, H.: Local-density subspace distributed clustering for high-dimensional data. IEEE Transactions on Parallel and Distributed System 31, 1799–1814 (2020)
Guha, S., Li, Y., Zhang, Q.: Distributed partial clustering. ACM Transactions on Parallel Computing (TOPC) 6(3), 1–20 (2019)
Guijo-Rubio, D., Durán-Rosal, A.M., Gutiérrez, P.A., Troncoso, A., Hervás-Martínez, C.: Time-series clustering based on the characterization of segmtimeent typologies. IEEE Transactions on Cybernetics 51(11), 5409–5422 (2020)
Hu, L., Zhong, C.: An internal validity index based on density-involved distance. IEEE Access 7, 40038–40051 (2019)
Huang, H., Wang, C., Rubelt, F., Scriba, T.J., Davis, M.M.: Analyzing the mycobacterium tuberculosis immune response by t-cell receptor clustering with gliph2 and genome-wide antigen screening. Nat. Biotechnol. 38(10), 1194–1202 (2020)
Iglesias, F., Zseby, T., Zimek, A.: Absolute cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 42(9), 2096–2112 (2020)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit Lett. 31(8), 651–666 (2010). (award winning papers from the 19th International Conference on Pattern Recognition (ICPR))
Januzaj, Eshref, Kriegel, Hans-Peter, Pfeifle, Martin, Scalable density-based distributed clustering, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 231–244 (2004)
Januzaj, Eshref, Kriegel, Hans-Peter, M. Pfeifle, Dbdc: Density based distributed clustering, in: International Conference on Extending Database Technology, Springer, (2004) 88–105
Karunanithy, K., Velusamy, B.: Cluster-tree based energy efficient data gathering protocol for industrial automation using wsns and iot. J. Ind. Inf. Integr. 19, 100156 (2020)
Lei, Y., Bezdek, J.C., Romano, S., Vinh, N.X., Chan, J., Bailey, J.: Ground truth bias in external cluster validity indices. Pattern Recogn. 65, 58–70 (2017)
Lipor, J., Balzano, L.: Clustering quality metrics for subspace clustering. Pattern Recogn. 104, 107328 (2020)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics 43(3), 982–994 (2013)
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability, 281–297 (1967)
Maurizio, F., Francesco, C., Francesco, M., Stefano, R.: A survey of kernel and spectral methods for clustering. Pattern Recogn. 41(1), 176–190 (2008)
Miao, J., Zhou, X., Huang, T.-Z.: Local segmentation of images using an improved fuzzy c-means clustering algorithm based on self-adaptive dictionary learning. Appl. Soft Comput. 91, 106200 (2020)
Mohapatra, A.D., Sahoo, M.N., Sangaiah, A.K.: Distributed fault diagnosis with dynamic cluster-head and energy efficient dissemination model for smart city. Sustain. Cities Soc. 43, 624–634 (2018)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Rojas-Thomas, J., Santos, M., Mora, M.: New internal index for clustering validation based on graphs. Expert Syst. Appl. 86, 334–349 (2017)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Sattler, F., Müller, K.-R., Samek, W.: Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems 32(8), 3710–3722 (2020)
Sekar, E.V., Anuradha, J., Arya, A., Balusamy, B., Chang, V.: A framework for smart traffic management using hybrid clustering techniques. Clust. Comput. 21, 347–362 (2018)
Tripathi, A.K., Sharma, K., Bala, M., Kumar, A., Menon, V.G., Bashir, A.K.: A parallel military-dog-based algorithm for clustering big data in cognitive industrial internet of things. IEEE Trans. Industr. Inf. 17(3), 2134–2142 (2021)
Vinh, N., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Wang,T., Liang,Y., Shen,X., Zheng,X., Mahmood,A., Sheng,Q. Z.: Edge computing and sensor-cloud: Overview, solutions, and directions, ACM Computing Surveys (2023)
Wang,T., Sun,B., Wang,L., Zheng,X., Jia,W.: Eidls: An edge-intelligence-based distributed learning system over internet of things, IEEE Transactions on Systems, Man, and Cybernetics: Systems (2023)
Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13(08), 841–847 (1991)
Yan, M., Chen, Y., Hu, X., Cheng, D., Chen, Y., Du, J.: Intrusion detection based on improved density peak clustering for imbalanced data on sensor-cloud systems. J. Syst. Architect. 118, 102212 (2021)
Yan, M., Chen, Y., Chen, Y., Zeng, G., Hu, X., Du, J.: A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks. Knowl.-Based Syst. 244, 108513 (2022)
Zhang,Y., Cheny,S., Yu,G.: Efficient distributed density peaks for clustering large data sets in mapreduce, in: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 67–68 (2017)
Zhao, Y., Tarus, S.K., Yang, L.T., Sun, J., Ge, Y., Wang, J.: Privacy-preserving clustering for big data in cyber-physical-social systems: Survey and perspectives. Inf. Sci. 515, 132–155 (2020)
Zhou,X., Ye,X., Kevin,I., Wang,K., Liang,W., Nair,N. K. C., Shimizu,S., Yan,Z., Jin,Q.: Hierarchical federated learning with social context clustering-based participant selection for internet of medical things applications, IEEE Transactions on Computational Social Systems (2023)
Zhou,X., Zheng,X., Cui,X., Shi,J., Liang,W., Yan,Z., Yang,L. T., Shimizu,S., Kevin,I., Wang,K.: Digital twin enhanced federated reinforcement learning with lightweight knowledge distillation in mobile networks, IEEE Journal on Selected Areas in Communications (2023)
Zhou, X., Liang, W., Kevin, I., Wang, K., Yan, Z., Yang, L.T., Wei, W., Ma, J., Jin, Q.: Decentralized p2p federated learning for privacy-preserving and resilient mobile robotic systems. IEEE Wirel. Commun. 30(2), 82–89 (2023)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61673186, 61972010); the Natural Science Foundation of Fujian Province, China (Nos. 2021J01317, 2020J05059); the Scientific Research Funds of Huaqiao University (No. 19BS307); the Open Project of China Food Flavor and Nutrition Health Innovation Center (No. CFC2023B-029).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., Yang, Y. & Chen, Y. C4y: a metric for distributed IoT clustering. CCF Trans. Pervasive Comp. Interact. (2024). https://doi.org/10.1007/s42486-024-00148-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42486-024-00148-x