A Semi Brute-Force Search Approach for (Balanced) Clustering

Xu, Yicheng; Chau, Vincent; Wu, Chenchen; Zhang, Yong; Zissimopoulos, Vassilis; Zou, Yifei

doi:10.1007/s00453-023-01158-4

A Semi Brute-Force Search Approach for (Balanced) Clustering

Published: 01 August 2023

Volume 86, pages 130–146, (2024)
Cite this article

Algorithmica Aims and scope Submit manuscript

Yicheng Xu¹,
Vincent Chau²,
Chenchen Wu³,
Yong Zhang¹,
Vassilis Zissimopoulos⁴ &
…
Yifei Zou⁵

240 Accesses
Explore all metrics

Abstract

Clustering is one of the most long-standing fundamental problems in the fields of computational geometry and algorithm design. In this paper, we focus on the variance-based clustering problems, included in which is the widely known k-means clustering. As the main contribution, a so-called semi brute-force search approach is proposed and analyzed from both theoretical and experimental aspects. The proposed approach samples a small percentage from the input dataset and search in a brute-force way for a k sized seed whose resulting Voronoi Diagram gives a good clustering of the original dataset. With high probability, the clustering is provably good to estimate the optimum under certain assumptions. Extensive experiments on both synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod in IEEE Trans Inf Theory 28(2):129–137, 1982) and k-means++ method (Arthur and Vassilvitskii, (in: 18th ACM-SIAM symposium on discrete algorithms (SODA), 2007)), we only need a subset of 7% size comparing with the input dataset. If we are allowed to sample 15% from the dataset, our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Data clustering: application and trends

Article 27 November 2022

An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

Article 25 June 2022

Notes

Note that \(V_1\), \(V_2\) are not necessary to be independent for Fact 1 to hold, but it does not hurt to say so.

References

Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means++. In: Proceedings of the 38th Very Large Data Bases (VLDB), pp. 622–633 (2012)
Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate \(k\)-means++ in sublinear time. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), pp. 1459–1467 (2016)
Lattanzi, S., Sohler, C.: A better \(k\)-means++ algorithm via local search. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3662–3671 (2019)
Choo, D., Grunau, C., Portmann, J., Rozhoň, V.: \(k\)-means++: few more steps yield constant approximation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1909–1917 (2020)
Awasthi, P., Charikar, M., Krishnaswamy, R., Sinop, A.K.: The hardness of approximation of euclidean \(k\)-means. In: Proceedings of the 31st ACM Symposium on International Symposium on Computational Geometry (SoCG), pp. 754–767 (2015)
Liu, H., Han, J., Nie, F., Li, X.: Balanced clustering with least square regression. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pp. 2231–2237 (2017)
Li, Z., Nie, F., Chang, X., Ma, Z., Yang, Y.: Balanced clustering via exclusive lasso: A pragmatic approach. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 3596–3603 (2018)
Lin, W., He, Z., Xiao, M.: Balanced clustering: A uniform model and fast algorithm. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2987–2993 (2019)
Xu, Y., Möhring, R.H., Xu, D., Zhang, Y., Zou, Y.: A constant fpt approximation algorithm for hard-capacitated \(k\)-means. Optim. Eng. 21(3), 709–722 (2020)
Article MathSciNet Google Scholar
Cohen-Addad, V., Li, J.: On the fixed-parameter tractability of capacitated clustering. In: Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP), pp. 1–14 (2019)
Bandyapadhyay, S., Lochet, W., Saurabh, S.: FPT constant-approximations for capacitated clustering to minimize the sum of cluster radii. To appear in the proceedings of the 39th International Symposium on Computational Geometry (SoCG), (2023)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the \(k\)-means problem. J. ACM 59(6), 1–22 (2013)
Article MathSciNet Google Scholar
Braverman, V., Cohen-Addad, V., Jiang, H.C.S., Krauthgamer, R., Schwiegelshohn, C., Toftrup, M.B., Wu, X.: The power of uniform sampling for coresets. In: Proceedings of the 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 462–473 (2022)
Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)
Article MathSciNet Google Scholar
Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceedings of the 10th ACM Symposium International Symposium on Computational Geometry (SoCG), pp. 332–339 (1994)
Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows - Theory, Algorithms and Applications. Prentice Hall, USA (1993)
Google Scholar

Download references

Funding

Yicheng Xu is supported by Fundamental Research Project of Shenzhen City (No. JCYJ20210324102012033) and Shenzhen Science and Technology Program (No. ZDSYS20220422103800001 and CJGJZD20210408092806017). Vincent Chau is supported by NSFC (No. 62202100), and by the Fundamental Research Funds for the Central Universities (No. 2242022R10024). Chenchen Wu is supported by National Natural Science Foundation of China (No.11971349). Yong Zhang is supported by National Key R &D Program of China (No. 2022YFE0196100) and National Natural Science Foundation of China (No. 12071460). Vassilis Zissimopoulos is supported by the CAS President’s International Fellowship Initiative (No. 2019VTA0005). Yifei Zou is supported by Shandong Science Fund for Excellent Young Scholars (No. 2023HWYQ-007).

Author information

Authors and Affiliations

Shenzhen Insititute of Advanced Technology, Chinese Academy of Sciences, Beijing, China
Yicheng Xu & Yong Zhang
Southeast University, Dhaka, China
Vincent Chau
Tianjin University of Technology, Tianjin, China
Chenchen Wu
National and Kapodistrian University of Athens, Zografou, Greece
Vassilis Zissimopoulos
Shandong University, Jinan, China
Yifei Zou

Authors

Yicheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Chau
View author publications
You can also search for this author in PubMed Google Scholar
Chenchen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Vassilis Zissimopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Zou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YX and VC wrote the main manuscript text. All other authors participated on the problem discussion and helped to review the manuscript.

Corresponding author

Correspondence to Vincent Chau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, Y., Chau, V., Wu, C. et al. A Semi Brute-Force Search Approach for (Balanced) Clustering. Algorithmica 86, 130–146 (2024). https://doi.org/10.1007/s00453-023-01158-4

Download citation

Received: 28 March 2023
Accepted: 22 July 2023
Published: 01 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00453-023-01158-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Semi Brute-Force Search Approach for (Balanced) Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Semi Brute-Force Search Approach for (Balanced) Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation