Skip to main content

Fast Hard Clustering Based on Soft Set Multinomial Distribution Function

  • 93 Accesses

Part of the Lecture Notes in Networks and Systems book series (LNNS,volume 457)

Abstract

Categorical data clustering is still an issue due to difficulties/complexities of measuring the similarity of data. Several approaches have been introduced and recently the centroid-based approaches were introduced to reduce the complexities of the similarity of categorical data. However, those techniques still produce high computational times. In this paper, we proposed a clustering technique based on soft set theory for categorical data via multinomial distribution called Hard Clustering using Soft Set based on Multinomial Distribution Function (HCSS). The data is represented as a multi soft set where every soft set have its probability to be a member of the clusters. Firstly, the corrected proof is shown mathematically. Then, the experiment is conducted to evaluate the processing times, purity and rand index using benchmarks datasets. The experiment results show that the proposed approach have improve the processing times up to 95.03% by not compromising the purity and rand index as compared with baseline techniques.

Keywords

  • Clustering
  • Categorical data
  • Multi soft set
  • Multinomial distribution function

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-00828-3_1
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   229.00
Price excludes VAT (USA)
  • ISBN: 978-3-031-00828-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   299.99
Price excludes VAT (USA)
Fig. 1.

Abbreviations

\(S\)::

Information system/information Table

\({\mathrm{S}}_{\left\{\mathrm{0,1}\right\}}\) ::

System with value {0, 1}

\(U\)::

Universe

\(|U|\)::

Cardinality of U

\(u\)::

Object of U

\(A\)::

Set of Attribute/Variables

\(a\)::

Subset of attribute

\(E\)::

Parameter in soft set

\(i\)::

Index \(i\)

\(j\)::

Index \(j\)

\(k\)::

Indek \(k\)

\(l\)::

Index \(l\)

\(e\)::

Subset of parameter

\(V\)::

Domain Value set

\({V}_{a}\)::

Domain (values set) of variable \(a\)

\(f\)::

Information Function

\(F\)::

Maps parameter function

\(y\)::

Object

\(P(U)\)::

Power of Universe

\((F,A)\)::

Soft set

\(F\left(a\right)\)::

Soft set of parameter \(a\)

\({C}_{\left(F,E\right)}\)::

Class soft set

\(P\)::

Probability

\({p}_{i}\)::

Probability for each trial \(i\)

\(f\left(x,{a}_{k}\right)\)::

Probability mass function

\({n}_{i}, {N}_{i}\)::

Number of Trial \(i\)

\(\lambda \)::

Probability of multinomial distribution

\({C}_{k}\)::

Cluster \(k\)

\(K\)::

Number of clusters

\({z}_{ik}\)::

Indicator function

\(CML\left(z,\lambda \right)\)::

Conditional maximum likelihood function

\(Maximize{L}_{CML}\left(z,\lambda \right)\)::

Maximizing the log-likelihood function

\({L}_{CML}\left(z,\lambda ,{w}_{1},{w}_{2}\right)\)::

Lagrange function

\({w}_{1}\)::

Lagrange multiplier constrains 1

\({w}_{2}\)::

Lagrange multiplier constrains 2

HCSS::

Hard Clustering using Soft Set based on Multinomial Distribution Function

References

  1. Arora, J., Tushir, M.: An enhanced spatial intuitionistic fuzzy c-means clustering for image segmentation. Procedia Comput. Sci. 167, 646–655 (2020)

    CrossRef  Google Scholar 

  2. Chen, L., Wang, K., Wu, M., Pedrycz, W., Hirota, K.: K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition. IFAC-PapersOnLine 53(2), 10250–10254 (2020)

    CrossRef  Google Scholar 

  3. Singh, S., Srivastava, S.: Review of clustering techniques in control system. Procedia Comput. Sci. 173, 272–280 (2020)

    CrossRef  Google Scholar 

  4. Sinaga, K.P., Yang, M.: Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020)

    CrossRef  Google Scholar 

  5. Joshi, R., Prasad, R., Mewada, P., Saurabh, P.: Modified LDA approach for cluster based gene classification using k-mean method. Procedia Comput. Sci. 171, 2493–2500 (2020)

    CrossRef  Google Scholar 

  6. Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 503–507 (2007)

    CrossRef  Google Scholar 

  7. San, O.M., Van-Nam, H., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. 14(2), 241–247 (2004)

    MathSciNet  MATH  Google Scholar 

  8. He, Z., Deng, S., Xu, X.: Improving k-modes algorithm considering frequencies of attribute values in mode. In: Hao, Y., et al. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 157–162. Springer, Heidelberg (2005). https://doi.org/10.1007/11596448_23

    CrossRef  Google Scholar 

  9. Huang, M.K.N.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999). https://doi.org/10.1109/91.784206

    CrossRef  Google Scholar 

  10. Wei, M.W.M., Xuedong, H.X.H., Zhibo, C.Z.C., Haiyan, Z.H.Z., Chunling, W.C.W.: Multi-agent reinforcement learning based on bidding. In: 2009 First International Conference on Information Science and Engineering (ICISE), vol. 20, no. 3 (2009)

    Google Scholar 

  11. Wei, W., Liang, J., Guo, X., Song, P., Sun, Y.: Hierarchical division clustering framework for categorical data. Neurocomputing 341, 118–134 (2019)

    CrossRef  Google Scholar 

  12. Saha, I., Sarkar, J.P., Maulik, U.: Integrated rough fuzzy clustering for categorical data analysis. Fuzzy Sets Syst. 361, 1–32 (2019)

    MathSciNet  CrossRef  Google Scholar 

  13. Xiao, Y., Huang, C., Huang, J., Kaku, I., Xu, Y.: Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering. Pattern Recog. 90, 183–195 (2019)

    CrossRef  Google Scholar 

  14. Zhu, S., Xu, L.: Many-objective fuzzy centroids clustering algorithm for categorical data. Expert Syst. Appl. 96, 230–248 (2018)

    CrossRef  Google Scholar 

  15. Liu, C., et al.: A moving shape-based robust fuzzy k-modes clustering algorithm for electricity profiles. Electr. Power Syst. Res. 187, 106425 (2020)

    Google Scholar 

  16. Golzari Oskouei, A., Balafar, M.A., Motamed, C.: FKMAWCW: categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning. Chaos, Solitons Fractals 153, 111494 (2021)

    Google Scholar 

  17. Kuo, R.J., Zheng, Y.R., Nguyen, T.P.Q.: Metaheuristic-based possibilistic fuzzy k-modes algorithms for categorical data clustering. Inf. Sci. (Ny) 557, 1–15 (2021)

    MathSciNet  CrossRef  Google Scholar 

  18. Kim, D.-W., Lee, K.H., Lee, D.: Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recogn. Lett. 25(11), 1263–1271 (2004)

    CrossRef  Google Scholar 

  19. Nooraeni, R., Arsa, M.I., Kusumo Projo, N.W.: Fuzzy centroid and genetic algorithms: solutions for numeric and categorical mixed data clustering. Procedia Comput. Sci. 179(2020), 677–684 (2021)

    Google Scholar 

  20. Schubert, E., Rousseeuw, P.J.: Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Inf. Syst. 101, 101804 (2021)

    Google Scholar 

  21. Leopold, N., Rose, O.: UNIC: A fast nonparametric clustering. Pattern Recogn. 100, 107117 (2020)

    Google Scholar 

  22. Morris, D.S., Raim, A.M., Sellers, K.F.: A conway–maxwell-multinomial distribution for flexible modeling of clustered categorical data. J. Multivar. Anal. 179, 104651 (2020)

    Google Scholar 

  23. Yang, M.S., Chiang, Y.H., Chen, C.C., Lai, C.Y.: A fuzzy k-partitions model for categorical data and its comparison to the GoM model. Fuzzy Sets Syst. 159(4), 390–405 (2008)

    MathSciNet  CrossRef  Google Scholar 

  24. Herawan, T., Deris, M.M.: On multi-soft sets construction in information systems. In: Huang, D.-S., Jo, K.-H., Lee, H.-H., Kang, H.-J., Bevilacqua, V. (eds.) ICIC 2009. LNCS (LNAI), vol. 5755, pp. 101–110. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04020-7_12

    CrossRef  Google Scholar 

  25. Molodtsov, D.: Soft set theory—first results. Comput. Math. Appl. 37(4–5), 19–31 (1999)

    MathSciNet  CrossRef  Google Scholar 

  26. Hartama, D., Yanto, I.T.R., Zarlis, M.: A soft set approach for fast clustering attribute selection. In: 2016 International Conference on Informatics and Computing (ICIC), pp. 12–15 (2016)

    Google Scholar 

  27. Jacob, D.W., Yanto, I.T.R., Md Fudzee, M.F., Salamat, M.A.: Maximum attribute relative approach of soft set theory in selecting cluster attribute of electronic government data set. In: Ghazali, R., Deris, M.M., Nawi, N.M., Abawajy, J.H. (eds.) SCDM 2018. AISC, vol. 700, pp. 473–484. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72550-5_45

    CrossRef  Google Scholar 

  28. Sutoyo, E., Yanto, I.T.R., Saadi, Y., Chiroma, H., Hamid, S., Herawan, T.: A framework for clustering of web users transaction based on soft set theory. In: Abawajy, J.H., Othman, M., Ghazali, R., Deris, M.M., Mahdin, H., Herawan, T. (eds.) Proceedings of the International Conference on Data Engineering 2015 (DaEng-2015), pp. 307–314. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1799-6_32

    CrossRef  Google Scholar 

  29. Malefaki, S., Iliopoulos, G.: Simulating from a multinomial distribution with large number of categories. Comput. Stat. Data Anal. 51(12), 5471–5476 (2007)

    MathSciNet  CrossRef  Google Scholar 

  30. Dheeru, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iwan Tri Riyadi Yanto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Yanto, I.T.R., Setiyowati, R., Deris, M.M., Senan, N. (2022). Fast Hard Clustering Based on Soft Set Multinomial Distribution Function. In: Ghazali, R., Mohd Nawi, N., Deris, M.M., Abawajy, J.H., Arbaiy, N. (eds) Recent Advances in Soft Computing and Data Mining. SCDM 2022. Lecture Notes in Networks and Systems, vol 457. Springer, Cham. https://doi.org/10.1007/978-3-031-00828-3_1

Download citation