Abstract
This paper investigates a mathematical programming based methodology for solving the minimum sum-of-squares clustering problem, also known as the “k-means problem”, in the presence of side constraints. We propose several exact and approximate mixed-integer linear and nonlinear formulations. The approximations are based on norm inequalities and random projections, the approximation guarantees of which are based on an additive version of the Johnson–Lindenstrauss lemma. We perform computational testing (with fixed CPU time) on a range of randomly generated and real data instances of medium size, but with high dimensionality. We show that when side constraints make k-means inapplicable, our proposed methodology—which is easy and fast to implement and deploy—can obtain good solutions in limited amounts of time.
Similar content being viewed by others
Data availability
The datasets generated and analysed in this paper are available upon request from the corresponding author.
References
Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66, 671–687 (2003)
Ailon, N., Chazelle, B.: Approximate nearest neighbors and fast Johnson–Lindenstrauss lemma. In: Proceedings of the Symposium on the Theory Of Computing. STOC, vol. ’06. ACM, Seattle (2006)
Allen-Zhu, Z., Gelashvili, R., Micali, S., Shavit, N.: Sparse sign-consistent Johnson–Lindenstrauss matrices: compression with neuroscience-based constraints. Proc. Natl. Acad. Sci. 111(47), 16872–16876 (2014)
Aloise, D., Hansen, P., Liberti, L.: An improved column generation algorithm for minimum sum-of-squares clustering. Math. Program. A 131, 195–220 (2012)
Babaki, B., Guns, T., Nijssen, S.: Constrained clustering using column generation. In: Simonis, H. (ed.) Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems (CPAIOR), LNCS, vol. 8451. Springer, Heidelberg (2014)
Becchetti, L., Bury, M., Cohen-Addad, V., Grandoni, F., Schwiegelshohn, C.: Oblivious dimension reduction for \(k\)-means: beyond subspaces and the Johnson–Lindenstrauss lemma. In: Proceedings of the 51st Annual ACM Symposium on the Theory of Computing. STOC, pp. 1039–1050. ACM, New York (2019)
Bell, E.: The iterated exponential integers. Ann. Math. 39, 539–557 (1938)
Belotti, P., Lee, J., Liberti, L., Margot, F., Wächter, A.: Branching and bounds tightening techniques for non-convex MINLP. Optim. Methods Softw. 24(4), 597–634 (2009)
Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm: a survey. In: Kliemann, L., Sanders, P. (eds.) Algorithm Engineering. LNCS, vol. 9220, pp. 81–116. Springer, Cham (2016)
Blum, L., Shub, M., Smale, S.: On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions, and universal machines. Bull. AMS 21(1), 1–46 (1989)
Bonami, P., Biegler, L., Conn, A., Cornuéjols, G., Grossmann, I., Laird, C., Lee, J., Lodi, A., Margot, F., Sawaya, N., Wächter, A.: An algorithmic framework for convex mixed integer nonlinear programs. Discrete Optim. 5, 186–204 (2008)
Bonami, P., Lee, J.: BONMIN user’s manual. Tech. rep., IBM Corporation (2007)
Boutsidis, C., Zouzias, A., Drineas, P.: Random projections for \(k\)-means clustering. In: Advances in Neural Information Processing Systems. NIPS, pp. 298–306. NIPS Foundation, La Jolla (2010)
Braverman, V., Meyerson, A., Ostrovsky, R., Roytman, A., Shindler, M., Tagiku, B.: Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM Symposium on Discrete Algorithms. SODA, vol. 22, pp. 26–40. ACM, Philadelphia (2011)
Bury, M., Schwiegelshohn, C.: Random projection for \(k\)-means: maintaining coresets beyond merge & reduce. Tech. Rep. arXiv:1504.01584v3 (2015)
Clarkson, K., Woodruff, D.: Numerical linear algebra in the streaming model. In: Proceedings of the 41st Annual ACM Symposium on the Theory of Computing. STOC, pp. 205–241. ACM, New York (2009)
Cohen, M., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for \(k\)-means clustering and low-rank approximation. In: Proceedings of the 47th Annual ACM Symposium on the Theory of Computing. STOC, pp. 163–172. ACM, New York (2015)
D’Ambrosio, C., Liberti, L., Poirion, P.L., Vu, K.: Random projections for quadratic programs. Math. Program. B 183, 619–647 (2020)
Dao, T.B.H., Duong, K.C., Vrain, C.: Constrained minimum sum of squares clustering by constraint programming. In: Pesant, G. (ed.) Principles and Practice of Constraint Programming. LNCS, vol. 9255, pp. 557–573. Springer, Heidelberg (2015)
Dasgupta, S., Gupta, A.: An elementary proof of a theorem by Johnson and Lindenstrauss. Random Struct. Algorithms 22, 60–65 (2002)
Davidson, I., Ravi, S.: Clustering with constraints: feasibility issues and the \(k\)-means algorithm. In: Proceedings of the SIAM International Conference on Data Mining. ICDM, pp. 138–149. SIAM, Philadelphia (2005)
de Bruijn, N.: Asymptotic Methods in Analysis. Dover, New York (1981)
du Merle, O., Hansen, P., Jaumard, B., Mladenović, N.: An interior point algorithm for minimum sum-of-squares clustering. SIAM J. Sci. Comput. 21(4), 1485–1505 (2000)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml. Accessed 24 May 2020
Duong, K.-C., Vrain, C.: Constrained clustering by constraint programming. Artif. Intell. 244, 70–94 (2017)
Duran, M., Grossmann, I.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Program. 36, 307–339 (1986)
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1, 211–218 (1936)
Fischetti, M., Lodi, A.: Local branching. Math. Program. 98, 23–37 (2005)
Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, Chichester (1991)
Fletcher, R., Leyffer, S.: Solving mixed integer nonlinear programs by outer approximation. Math. Program. 66, 327–349 (1994)
Fletcher, R., Leyffer, S.: Numerical experience with lower bounds for MIQP branch-and-bound. SIAM J. Optim. 8(2), 604–616 (1998)
Fourer, R., Gay, D.: The AMPL Book. Duxbury Press, Pacific Grove (2002)
Gleixner, A., Bastubbe, M., Eifler, L., Gally, T., Gamrath, G., Gottwald, R.L., Hendel, G., Hojny, C., Koch, T., Lübbecke, M.E., Maher, S.J., Miltenberger, M., Müller, B., Pfetsch, M.E., Puchert, C., Rehfeldt, D., Schlösser, F., Schubert, C., Serrano, F., Shinano, Y., Viernickel, J.M., Walter, M., Wegscheider, F., Witt, J.T., Witzig, J.: The SCIP optimization suite 6.0. Technical report, Optimization Online (2018). http://www.optimization-online.org/DB_HTML/2018/07/6692.html
Gordon, A., Henderson, J.: An algorithm for Euclidean sum of squares classification. Biometrics 33(2), 355–362 (1977)
Goubault, E., Roux, S.L., Leconte, J., Liberti, L., Marinelli, F.: Static analysis by abstract interpretation: a mathematical programming approach. In: Miné, A., Rodriguez-Carbonell, E. (eds.) Proceeding of the Second International Workshop on Numerical and Symbolic Abstract Domains. Electronic Notes in Theoretical Computer Science, vol. 267(1), pp. 73–87. Elsevier (2010)
Grossi, V., Monreale, A., Nanni, M., Pedreschi, D., Turini, F.: Clustering formulation using constraint optimization. In: Bianculli, D. et al. (ed.) SEFM Workshops. LNCS, vol. 9509, pp. 93–107. Springer, Heidelberg (2015)
Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Math. Program. 79, 191–215 (1997)
IBM: ILOG CPLEX 12.8 user’s manual. IBM (2017)
IBM: ILOG CPLEX 12.10 user’s manual. IBM (2020)
Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. In: Hedlund, G. (ed.) Conference in Modern Analysis and Probability. Contemporary Mathematics, vol. 26, pp. 189–206. AMS, Providence, RI (1984)
Klein, D., Kamvar, S., Manning, C.: From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th International Conference on Machine Learning. ICML, pp. 307–314. Morgan Kaufmann, San Francisco (2002)
Liberti, L.: Reformulations in mathematical programming: definitions and systematics. RAIRO-RO 43(1), 55–86 (2009)
Liberti, L.: Undecidability and hardness in mixed-integer nonlinear programming. RAIRO Oper. Res. 53, 81–109 (2019)
Liberti, L., Cafieri, S., Tarissan, F.: Reformulations in mathematical programming: a computational approach. In: Abraham, A., Hassanien, A.E., Siarry, P., Engelbrecht, A. (eds.) Foundations of Computational Intelligence Studies in Computational Intelligence, vol. 3, no. 203, pp. 153–234. Springer, Berlin (2009)
Liberti, L., Marinelli, F.: Mathematical programming: turing completeness and applications to software analysis. J. Comb. Optim. 28(1), 82–104 (2014)
Lovasz, L.: Combinatorial Problems and Exercises. North-Holland, Amsterdam (1993)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pham, N.T.: Quantile regression in large energy datasets. Master’s thesis, LIX, Ecole Poltyechnique (2018)
Pilanci, M., Wainwright, M.: Randomized sketches of convex programs with sharp guarantees. In: International Symposium on Information Theory (ISIT), pp. 921–925. IEEE, Piscataway (2014)
Pilanci, M., Wainwright, M.: Newton sketch: a linear time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27(1), 205–245 (2017)
Sarlós, T.: Improved approximation algorithms for large matrices via random projections. In: Proceedings of the Annual IEEE Symposium on Foundations of Computer Science. FOCS, vol. 47, pp. 143–152. IEEE, Washington (2006)
Smith, E., Pantelides, C.: A symbolic reformulation/spatial branch-and-bound algorithm for the global optimisation of nonconvex MINLPs. Comput. Chem. Eng. 23, 457–478 (1999)
Steinhaus, H.: Sur la division des corps matériels en parties. Bull. Acad. Pol. des Sci. Cl. III 4(12), 801–804 (1956)
Steinley, D.: K-means clustering: a half-century synthesis. Br. J. Math. Stat. Psychol. 59, 1–34 (2006)
Tawarmalani, M., Sahinidis, N.: Global optimization of mixed integer nonlinear programs: a theoretical and computational study. Math. Program. 99, 563–591 (2004)
van Rossum, G., et al.: Python language reference, version 3. Python Software Foundation (2019)
Vempala, S.: The Random Projection Method. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 65. AMS, Providence (2004)
Vershynin, R.: High-dimensional Probability. CUP, Cambridge (2018)
Vu, K., Poirion, P.L., D’Ambrosio, C., Liberti, L.: Random projections for quadratic programs over a Euclidean ball. In: Lodi, A., et al. (eds.) Integer Programming and Combinatorial Optimization (IPCO). LNCS, vol. 11480, pp. 442–452. Springer, New York (2019)
Vu, K., Poirion, P.L., Liberti, L.: Random projections for linear programming. Math. Oper. Res. 43(4), 1051–1071 (2018)
Vu, K., Poirion, P.L., Liberti, L.: Gaussian random projections for Euclidean membership problems. Discrete Appl. Math. 253, 93–102 (2019)
Wächter, A., Biegler, L.: On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program. 106(1), 25–57 (2006)
Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: Proceedings of the 17th International Conference on Machine Learning. ICML, pp. 1103–1110. Morgan Kaufmann, San Francisco (2000)
Wang, O., de Sainte Marie, C., Ke, C., Liberti, L.: Universality and prediction in business rules. Comput. Intell. 34, 763–785 (2018)
Yang, J., Meng, X., Mahoney, M.: Quantile regression for large-scale applications. SIAM J. Sci. Comput. 36(5), S78–S110 (2014)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The first author has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement No. 764759 “MINOA”. The second author was supported by KASBA, funded by Regione Autonoma della Sardegna.
Rights and permissions
About this article
Cite this article
Liberti, L., Manca, B. Side-constrained minimum sum-of-squares clustering: mathematical programming and random projections. J Glob Optim 83, 83–118 (2022). https://doi.org/10.1007/s10898-021-01047-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-021-01047-6