A sampling-based exact algorithm for the solution of the minimax diameter clustering problem
We consider the problem of clustering a set of points so as to minimize the maximum intra-cluster dissimilarity, which is strongly NP-hard. Exact algorithms for this problem can handle datasets containing up to a few thousand observations, largely insufficient for the nowadays needs. The most popular heuristic for this problem, the complete-linkage hierarchical algorithm, provides feasible solutions that are usually far from optimal. We introduce a sampling-based exact algorithm aimed at solving large-sized datasets. The algorithm alternates between the solution of an exact procedure on a small sample of points, and a heuristic procedure to prove the optimality of the current solution. Our computational experience shows that our algorithm is capable of solving to optimality problems containing more than 500,000 observations within moderate time limits, this is two orders of magnitude larger than the limits of previous exact methods.
KeywordsClustering Diameter Large-scale optimization
This research was financed by the Fonds de recherche du Québec - Nature et technologies (FRQNT) under grant no 181909 and by the Natural Sciences and Engineering Research Council of Canada (NSERC) under grants 435824-2013 and 2017-05617. These supports are gratefully acknowledged.
- 1.Alcock, R., Manolopoulos, Y.: Time-series similarity queries employing a feature-based approach. In: 7th Hellenic Conference on Informatics, Ioannina, Greece, pp. 27–29 (1999)Google Scholar
- 4.Blackard, J.A.: Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. thesis, Colorado State University (1998)Google Scholar
- 5.Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: KDD’98 proceedings of the fourth international conference on knowledge discovery and data mining, pp. 9–15 (1998)Google Scholar
- 19.Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 27 Feb 2018
- 22.Prokhorov, D.: IJCNN 2001 neural network competition. Slide presentation in IJCNN, 1, 97 (2001)Google Scholar
- 24.Siebert, J.P.: Vehicle recognition using rule based methods. Research Memorandum TIRM-87-018, Turing Institute (1987)Google Scholar
- 25.Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol Skr 5, 1–34 (1948)Google Scholar
- 26.Torgo, L.: Regression datasets (2009). http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html. Accessed 27 Feb 2018
- 27.Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiu, R., Fuks, H.: Wearable computing: Accelerometers’ data classification of body postures and movements. In: Proceedings of 21st Brazilian Symposium on Artificial Intelligence, Springer, Berlin/Heidelberg, Lecture Notes in Computer Science, pp. 52–61 (2012)Google Scholar