Abstract
Categorical data clustering is still an issue due to difficulties/complexities of measuring the similarity of data. Several approaches have been introduced and recently the centroid-based approaches were introduced to reduce the complexities of the similarity of categorical data. However, those techniques still produce high computational times. In this paper, we proposed a clustering technique based on soft set theory for categorical data via multinomial distribution called Hard Clustering using Soft Set based on Multinomial Distribution Function (HCSS). The data is represented as a multi soft set where every soft set have its probability to be a member of the clusters. Firstly, the corrected proof is shown mathematically. Then, the experiment is conducted to evaluate the processing times, purity and rand index using benchmarks datasets. The experiment results show that the proposed approach have improve the processing times up to 95.03% by not compromising the purity and rand index as compared with baseline techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- \(S\)::
-
Information system/information Table
- \({\mathrm{S}}_{\left\{\mathrm{0,1}\right\}}\) ::
-
System with value {0, 1}
- \(U\)::
-
Universe
- \(|U|\)::
-
Cardinality of U
- \(u\)::
-
Object of U
- \(A\)::
-
Set of Attribute/Variables
- \(a\)::
-
Subset of attribute
- \(E\)::
-
Parameter in soft set
- \(i\)::
-
Index \(i\)
- \(j\)::
-
Index \(j\)
- \(k\)::
-
Indek \(k\)
- \(l\)::
-
Index \(l\)
- \(e\)::
-
Subset of parameter
- \(V\)::
-
Domain Value set
- \({V}_{a}\)::
-
Domain (values set) of variable \(a\)
- \(f\)::
-
Information Function
- \(F\)::
-
Maps parameter function
- \(y\)::
-
Object
- \(P(U)\)::
-
Power of Universe
- \((F,A)\)::
-
Soft set
- \(F\left(a\right)\)::
-
Soft set of parameter \(a\)
- \({C}_{\left(F,E\right)}\)::
-
Class soft set
- \(P\)::
-
Probability
- \({p}_{i}\)::
-
Probability for each trial \(i\)
- \(f\left(x,{a}_{k}\right)\)::
-
Probability mass function
- \({n}_{i}, {N}_{i}\)::
-
Number of Trial \(i\)
- \(\lambda \)::
-
Probability of multinomial distribution
- \({C}_{k}\)::
-
Cluster \(k\)
- \(K\)::
-
Number of clusters
- \({z}_{ik}\)::
-
Indicator function
- \(CML\left(z,\lambda \right)\)::
-
Conditional maximum likelihood function
- \(Maximize{L}_{CML}\left(z,\lambda \right)\)::
-
Maximizing the log-likelihood function
- \({L}_{CML}\left(z,\lambda ,{w}_{1},{w}_{2}\right)\)::
-
Lagrange function
- \({w}_{1}\)::
-
Lagrange multiplier constrains 1
- \({w}_{2}\)::
-
Lagrange multiplier constrains 2
- HCSS::
-
Hard Clustering using Soft Set based on Multinomial Distribution Function
References
Arora, J., Tushir, M.: An enhanced spatial intuitionistic fuzzy c-means clustering for image segmentation. Procedia Comput. Sci. 167, 646–655 (2020)
Chen, L., Wang, K., Wu, M., Pedrycz, W., Hirota, K.: K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition. IFAC-PapersOnLine 53(2), 10250–10254 (2020)
Singh, S., Srivastava, S.: Review of clustering techniques in control system. Procedia Comput. Sci. 173, 272–280 (2020)
Sinaga, K.P., Yang, M.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)
Joshi, R., Prasad, R., Mewada, P., Saurabh, P.: Modified LDA approach for cluster based gene classification using k-mean method. Procedia Comput. Sci. 171, 2493–2500 (2020)
Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 503–507 (2007)
San, O.M., Van-Nam, H., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. 14(2), 241–247 (2004)
He, Z., Deng, S., Xu, X.: Improving k-modes algorithm considering frequencies of attribute values in mode. In: Hao, Y., et al. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 157–162. Springer, Heidelberg (2005). https://doi.org/10.1007/11596448_23
Huang, M.K.N.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999). https://doi.org/10.1109/91.784206
Wei, M.W.M., Xuedong, H.X.H., Zhibo, C.Z.C., Haiyan, Z.H.Z., Chunling, W.C.W.: Multi-agent reinforcement learning based on bidding. In: 2009 First International Conference on Information Science and Engineering (ICISE), vol. 20, no. 3 (2009)
Wei, W., Liang, J., Guo, X., Song, P., Sun, Y.: Hierarchical division clustering framework for categorical data. Neurocomputing 341, 118–134 (2019)
Saha, I., Sarkar, J.P., Maulik, U.: Integrated rough fuzzy clustering for categorical data analysis. Fuzzy Sets Syst. 361, 1–32 (2019)
Xiao, Y., Huang, C., Huang, J., Kaku, I., Xu, Y.: Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering. Pattern Recog. 90, 183–195 (2019)
Zhu, S., Xu, L.: Many-objective fuzzy centroids clustering algorithm for categorical data. Expert Syst. Appl. 96, 230–248 (2018)
Liu, C., et al.: A moving shape-based robust fuzzy k-modes clustering algorithm for electricity profiles. Electr. Power Syst. Res. 187, 106425 (2020)
Golzari Oskouei, A., Balafar, M.A., Motamed, C.: FKMAWCW: categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning. Chaos, Solitons Fractals 153, 111494 (2021)
Kuo, R.J., Zheng, Y.R., Nguyen, T.P.Q.: Metaheuristic-based possibilistic fuzzy k-modes algorithms for categorical data clustering. Inf. Sci. (Ny) 557, 1–15 (2021)
Kim, D.-W., Lee, K.H., Lee, D.: Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recogn. Lett. 25(11), 1263–1271 (2004)
Nooraeni, R., Arsa, M.I., Kusumo Projo, N.W.: Fuzzy centroid and genetic algorithms: solutions for numeric and categorical mixed data clustering. Procedia Comput. Sci. 179(2020), 677–684 (2021)
Schubert, E., Rousseeuw, P.J.: Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 101, 101804 (2021)
Leopold, N., Rose, O.: UNIC: A fast nonparametric clustering. Pattern Recogn. 100, 107117 (2020)
Morris, D.S., Raim, A.M., Sellers, K.F.: A conway–maxwell-multinomial distribution for flexible modeling of clustered categorical data. J. Multivar. Anal. 179, 104651 (2020)
Yang, M.S., Chiang, Y.H., Chen, C.C., Lai, C.Y.: A fuzzy k-partitions model for categorical data and its comparison to the GoM model. Fuzzy Sets Syst. 159(4), 390–405 (2008)
Herawan, T., Deris, M.M.: On multi-soft sets construction in information systems. In: Huang, D.-S., Jo, K.-H., Lee, H.-H., Kang, H.-J., Bevilacqua, V. (eds.) ICIC 2009. LNCS (LNAI), vol. 5755, pp. 101–110. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04020-7_12
Molodtsov, D.: Soft set theory—first results. Comput. Math. Appl. 37(4–5), 19–31 (1999)
Hartama, D., Yanto, I.T.R., Zarlis, M.: A soft set approach for fast clustering attribute selection. In: 2016 International Conference on Informatics and Computing (ICIC), pp. 12–15 (2016)
Jacob, D.W., Yanto, I.T.R., Md Fudzee, M.F., Salamat, M.A.: Maximum attribute relative approach of soft set theory in selecting cluster attribute of electronic government data set. In: Ghazali, R., Deris, M.M., Nawi, N.M., Abawajy, J.H. (eds.) SCDM 2018. AISC, vol. 700, pp. 473–484. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72550-5_45
Sutoyo, E., Yanto, I.T.R., Saadi, Y., Chiroma, H., Hamid, S., Herawan, T.: A framework for clustering of web users transaction based on soft set theory. In: Abawajy, J.H., Othman, M., Ghazali, R., Deris, M.M., Mahdin, H., Herawan, T. (eds.) Proceedings of the International Conference on Data Engineering 2015 (DaEng-2015), pp. 307–314. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1799-6_32
Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51(12), 5471–5476 (2007)
Dheeru, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yanto, I.T.R., Setiyowati, R., Deris, M.M., Senan, N. (2022). Fast Hard Clustering Based on Soft Set Multinomial Distribution Function. In: Ghazali, R., Mohd Nawi, N., Deris, M.M., Abawajy, J.H., Arbaiy, N. (eds) Recent Advances in Soft Computing and Data Mining. SCDM 2022. Lecture Notes in Networks and Systems, vol 457. Springer, Cham. https://doi.org/10.1007/978-3-031-00828-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-00828-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00827-6
Online ISBN: 978-3-031-00828-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)