Abstract
The task of grouping data points or instances into clusters is quite fundamental in data science. In general, clustering methods belong to the area of unsupervised learning because the data sets using such methods are unlabeled; that is, no information is available about the true cluster to which a data point belongs. The clustering is inferred by using distance- and similarity-based relations. The aim of clustering methods is to group a set of data points, which can correspond to a wide variety of objects, for example, texts, vectors, or networks into groups that we call clusters. Many different approaches can be used for defining clustering methods. Also, analyzing the validity of clusters can be quite intricate. However, in this chapter, we focus on clustering methods based on similarity and distance measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
J. Bacher, Clusteranalyse (Oldenbourg Verlag, Munich, 1996).
R. Baeza-Yates, B. Ribeiro-Neto (eds.), Modern Information Retrieval (Addison-Wesley, Reading, 1999).
H.H. Bock, Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten. Studia Mathematica (Vandenhoeck & Ruprecht, Göttingen, 1974).
D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structures (Research Studies Press, Chichester, 1983).
D. Cook, L.B. Holder, Mining graph data (Wiley-Interscience, Hoboken, 2007).
M. Dehmer, F. Emmert-Streib, Structural information content of networks: graph entropy based on local vertex functionals. Comput. Biol. Chem. 32, 131–138 (2008).
M. Dehmer, F. Emmert-Streib, Quantitative Graph Theory. Theory and Applications. (CRC Press, Boca Raton, 2014).
M. Dehmer, A. Mowshowitz, A history of graph entropy measures. Inf. Sci. 1, 57–78 (2011).
J. Devillers, A.T. Balaban, Topological indices and related descriptors in QSAR and QSPR (Gordon and Breach Science Publishers, Amsterdam, 1999).
M.M. Deza, E. Deza, Encyclopedia of distances, 2nd ed. (Springer, Berlin, 2012).
M.V. Diudea, I. Gutman, L. Jäntschi, Molecular topology (Nova Publishing, New York, 2001).
F. Emmert-Streib, M. Dehmer, Global information processing in gene networks: fault tolerance, in Proceedings of the Bio-Inspired Models of Network, Information, and Computing Systems, Bionetics 2007 (2007).
F. Emmert-Streib, M. Dehmer (eds.), Analysis of microarray data: a network-based approach. (Wiley VCH Publishing, Hoboken, 2010).
F. Emmert-Streib, M. Dehmer, Y. Shi, Fifty years of graph matching, network alignment and network comparison. Inf. Sci. 346–347, 180–197 (2016).
B.S. Everitt, S. Landau, M. Leese, D. Stah, Cluster Analysis, 5th ed. (Wiley-VCH, Weinheim, 2011).
M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques. J. Intel. Inf. Syst. 17, 107–145 (2001).
J. Han, M. Kamber, Data mining: concepts and techniques (Morgan and Kaufmann Publishers, Burlington, 2001).
F. Harary, Graph theory (Addison-Wesley Publishing Company, Reading, 1969).
T. Hastie, R. Tibshirani, J.H. Friedman, The elements of statistical learning. (Springer, Berlin, 2001).
W. Huber, V. Carey, L. Long, S. Falcon, R. Gentleman, Graphs in molecular biology. BMC Bioinf. 8(Suppl 6), S8 (2007).
A.K. Jain, R.C. Dubes, Algorithms for clustering data (Prentice-Hall Inc., Upper Saddle River, 1988).
L. Kaufman, P.J. Rousseeuw, Clustering by means of medoids (North Holland/Elsevier, Amsterdam, 1987), pp. 405–416.
K.G. Kugler, L.A.J. Müller, A. Graber, M. Dehmer, Integrative network biology: Graph prototyping for co-expression cancer networks. PLoS ONE 6, e22843 (2011).
J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (University of California Press, Berkeley, 1967), pp. 281–297.
A. Mowshowitz, Entropy and the complexity of the graphs I: an index of the relative complexity of a graph. Bull. Math. Biophys. 30, 175–204 (1968).
L. Mueller, K. Kugler, A. Graber, et al., Structural measures for network biology using QuACN. BMC Bioinf. 12(1), 492 (2011).
L.A.J. Müller, M. Schutte, K.G. Kugler, M. Dehmer, QuACN: Quantitative Analyze of Complex Networks (2012). R Package Version 1.6.
L.A.J. Müller, M. Dehmer, F. Emmert-Streib, Network-based methods for computational diagnostics by means of R, in Computational Medicine (Springer, Berlin, 2012), pp. 185–197.
M.E.J. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103, 8577–8582 (2006).
J. Oyelade, I. Isewon, F. Oladipupo, et al., Clustering algorithms: their application to gene expression data. Bioinf. Biol. Insights 10, 237–253 (2016).
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987).
S. Santini, R. Jain, Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 871–883 (1999).
N. Trinajstić, Chemical graph theory (CRC Press, Boca Raton, 1992).
J.H. Ward, Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Emmert-Streib, F., Moutari, S., Dehmer, M. (2023). Clustering. In: Elements of Data Science, Machine Learning, and Artificial Intelligence Using R. Springer, Cham. https://doi.org/10.1007/978-3-031-13339-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-13339-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13338-1
Online ISBN: 978-3-031-13339-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)