A fuzzy data reduction cluster method based on boundary information for large datasets
- 26 Downloads
The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 106 records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.
KeywordsFuzzy c-means Large dataset Boundary information
This work has been supported by the Brazilian agency CAPES, CNPq and FAPEMIG.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
- 6.Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, CanadaGoogle Scholar
- 7.Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65Google Scholar
- 8.Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517Google Scholar
- 14.R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019
- 17.Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1Google Scholar
- 18.UML Repository (2017) Iris. https://archive.ics.uci.edu/ml/datasets/iris. Accessed 2 Jan 2019
- 19.UML Repository (2017) Breast cancer. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Accessed 2 Jan 2019
- 20.Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. https://archive.ics.uci.edu/ml/datasets/Poker+Hand. Accessed 16 Aug 2017
- 21.Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring. Accessed 16 Aug 2017
- 23.Blackard JA (1998) Covertype data set. Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype. Accessed 16 Aug 2017
- 24.Rajen Bhatt AD (2012) Skin data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Accessed 16 Aug 2017
- 26.Jaccard P (1908) Nouvelles recherches sur la distribution floraleGoogle Scholar