Skip to main content
Log in

A fuzzy data reduction cluster method based on boundary information for large datasets

  • WSOM 2017
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 106 records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. https://cran.r-project.org/web/packages/mlbench/mlbench.pdf.

  2. https://archive.ics.uci.edu/ml/datasets/Poker+Hand.

  3. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring.

  4. http://yann.lecun.com/exdb/mnist/.

  5. https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/.

  6. https://archive.ics.uci.edu/ml/datasets/skin+segmentation.

References

  1. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267

    Article  Google Scholar 

  2. Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191

    Article  Google Scholar 

  3. Li F, Nath S (2014) Scalable data summarization on big data. Distrib Parallel Databases 32(3):313. https://doi.org/10.1007/s10619-014-7145-y

    Article  Google Scholar 

  4. Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795

    Article  MathSciNet  Google Scholar 

  5. Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325

    Article  MathSciNet  Google Scholar 

  6. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada

  7. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65

  8. Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517

  9. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38

    MathSciNet  MATH  Google Scholar 

  10. Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130

    Article  Google Scholar 

  11. Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229

    Article  Google Scholar 

  12. Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585

    Article  MathSciNet  Google Scholar 

  13. Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787

    Article  Google Scholar 

  14. R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019

  15. Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541. https://doi.org/10.1016/j.eswa.2015.05.014

    Article  Google Scholar 

  16. Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677

    MATH  Google Scholar 

  17. Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1

  18. UML Repository (2017) Iris. https://archive.ics.uci.edu/ml/datasets/iris. Accessed 2 Jan 2019

  19. UML Repository (2017) Breast cancer. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Accessed 2 Jan 2019

  20. Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. https://archive.ics.uci.edu/ml/datasets/Poker+Hand. Accessed 16 Aug 2017

  21. Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring. Accessed 16 Aug 2017

  22. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278

    Article  Google Scholar 

  23. Blackard JA (1998) Covertype data set. Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype. Accessed 16 Aug 2017

  24. Rajen Bhatt AD (2012) Skin data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Accessed 16 Aug 2017

  25. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433

    Article  Google Scholar 

  26. Jaccard P (1908) Nouvelles recherches sur la distribution florale

  27. Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Brazilian agency CAPES, CNPq and FAPEMIG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gustavo R. L. Silva.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Luiz C. B. Torres is a fellow of CNPq, Brazil (No. 150254/2016-4).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, G.R.L., Neto, P.C., Torres, L.C.B. et al. A fuzzy data reduction cluster method based on boundary information for large datasets. Neural Comput & Applic 32, 18059–18068 (2020). https://doi.org/10.1007/s00521-019-04049-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-019-04049-4

Keywords

Navigation