Clustering for Binary Featured Datasets
Clustering is one of the most important concepts for unsupervised learning in machine learning. While there are numerous clustering algorithms already, many, including the popular one—k-means algorithm, require the number of clusters to be specified in advance, a huge drawback. Some studies use the silhouette coefficient to determine the optimal number of clusters. In this study, we introduce a novel algorithm called Powered Outer Probabilistic Clustering, show how it works through back-propagation (starting with many clusters and ending with an optimal number of clusters) , and show that the algorithm converges to the expected (optimal) number of clusters on theoretical examples.
KeywordsBinary valued features Clustering Emails k-Means Optimal number of clusters Probabilities
The authors would like to thank David James Brunner for many fruitful discussions on knowledge workers information overload as well as proofreading of the first draft. The authors would also like to thank anonymous reviewers for providing feedback which led to significant improvement of this chapter.
- 1.J. Hartigan, Clustering Algorithms (Wiley, 1975)Google Scholar
- 6.P. Taraba, Powered outer probabilistic clustering, in Proceedings of the World Congress on Engineering and Computer Science 2017, 25–27 October, 2017, San Francisco, USA. Lecture Notes in Engineering and Computer Science (2017), pp. 394–398Google Scholar
- 7.P. Taraba, Popc examples [Online] (2017), https://github.com/pepe78/POPC-examples
- 8.P. Taraba, Small bang [Online] (2017), http://www.frisky.world/p/small-bang.html