A fuzzy data reduction cluster method based on boundary information for large datasets

Silva, Gustavo R. L.; Neto, Paulo C.; Torres, Luiz C. B.; Braga, Antônio P.

doi:10.1007/s00521-019-04049-4

A fuzzy data reduction cluster method based on boundary information for large datasets

WSOM 2017
Published: 04 February 2019

Volume 32, pages 18059–18068, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Gustavo R. L. Silva ORCID: orcid.org/0000-0002-1436-8485¹,
Paulo C. Neto¹,
Luiz C. B. Torres¹ &
…
Antônio P. Braga¹

308 Accesses
5 Citations
Explore all metrics

Abstract

The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 10⁶ records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Modified fuzzy c-mean for custom-sized clusters

Article 17 July 2019

Debjani Chakraborty & Suman Das

Tune Up Fuzzy C-Means for Big Data: Some Novel Hybrid Clustering Algorithms Based on Initial Selection and Incremental Clustering

Article 03 October 2016

Le Hoang Son & Nguyen Dang Tien

Improved fuzzy C-means algorithm based on density peak

Article 31 July 2019

Xiang-yi Liu, Jian-cong Fan & Zi-wen Chen

Notes

References

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267
Article Google Scholar
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191
Article Google Scholar
Li F, Nath S (2014) Scalable data summarization on big data. Distrib Parallel Databases 32(3):313. https://doi.org/10.1007/s10619-014-7145-y
Article Google Scholar
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795
Article MathSciNet Google Scholar
Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325
Article MathSciNet Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65
Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
MathSciNet MATH Google Scholar
Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130
Article Google Scholar
Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229
Article Google Scholar
Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585
Article MathSciNet Google Scholar
Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787
Article Google Scholar
R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019
Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541. https://doi.org/10.1016/j.eswa.2015.05.014
Article Google Scholar
Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677
MATH Google Scholar
Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1
UML Repository (2017) Iris. https://archive.ics.uci.edu/ml/datasets/iris. Accessed 2 Jan 2019
UML Repository (2017) Breast cancer. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Accessed 2 Jan 2019
Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. https://archive.ics.uci.edu/ml/datasets/Poker+Hand. Accessed 16 Aug 2017
Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring. Accessed 16 Aug 2017
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278
Article Google Scholar
Blackard JA (1998) Covertype data set. Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype. Accessed 16 Aug 2017
Rajen Bhatt AD (2012) Skin data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Accessed 16 Aug 2017
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433
Article Google Scholar
Jaccard P (1908) Nouvelles recherches sur la distribution florale
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the Brazilian agency CAPES, CNPq and FAPEMIG.

Author information

Authors and Affiliations

Graduate Program in Electrical Engineering, Federal University of Minas Gerais, Av. Antônio Carlos 6627, Belo Horizonte, MG, 31270-901, Brazil
Gustavo R. L. Silva, Paulo C. Neto, Luiz C. B. Torres & Antônio P. Braga

Authors

Gustavo R. L. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Paulo C. Neto
View author publications
You can also search for this author in PubMed Google Scholar
Luiz C. B. Torres
View author publications
You can also search for this author in PubMed Google Scholar
Antônio P. Braga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gustavo R. L. Silva.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Luiz C. B. Torres is a fellow of CNPq, Brazil (No. 150254/2016-4).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, G.R.L., Neto, P.C., Torres, L.C.B. et al. A fuzzy data reduction cluster method based on boundary information for large datasets. Neural Comput & Applic 32, 18059–18068 (2020). https://doi.org/10.1007/s00521-019-04049-4

Download citation

Received: 21 February 2018
Accepted: 23 January 2019
Published: 04 February 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00521-019-04049-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A fuzzy data reduction cluster method based on boundary information for large datasets

Abstract

Access this article

Similar content being viewed by others

Modified fuzzy c-mean for custom-sized clusters

Tune Up Fuzzy C-Means for Big Data: Some Novel Hybrid Clustering Algorithms Based on Initial Selection and Incremental Clustering

Improved fuzzy C-means algorithm based on density peak

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fuzzy data reduction cluster method based on boundary information for large datasets

Abstract

Access this article

Similar content being viewed by others

Modified fuzzy c-mean for custom-sized clusters

Tune Up Fuzzy C-Means for Big Data: Some Novel Hybrid Clustering Algorithms Based on Initial Selection and Incremental Clustering

Improved fuzzy C-means algorithm based on density peak

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation