Abstract
Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering (HC), K-means (KM), and expectation-maximization (EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap (APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.
Similar content being viewed by others
References
Altman, N., and Krzywinski, M., 2017. Points of Significance: Clustering. Nature Methods, 14 (6): 545–546, DOI: 10.1038/ nmeth.4299.
Arlia, D., and Coppola, M., 2001. Experiments in parallel clustering with DBSCAN. In: Euro-Par 2001 Parallel Processing. Euro-Par 2001. Lecture Notes in Computer Science, Vol 2150. Sakellariou, R., et al., eds., Springer, Berlin, 326–331, DOI: 10.1007/3-540-44681-8_46.
Arreguín-Sánchez, F., 1996. Catchability: A key parameter for fish stock assessment. Reviews in Fish Biology and Fisheries, 6 (2): 221–242.
Brock, G., Pihur, V., Datta, S., and Datta, S., 2011. clValid, an R package for cluster validation. Journal of Statistical Software, 25: 1–22.
Cao, Y., Bark, A. W., and Williams, W. P., 1997. A comparison of clustering methods for river benthic community analysis. Hydrobiologia, 347 (1-3): 24–40.
Clarke, K. R., Somerfield, P., and Gorley, R. N., 2016. Clustering in non-parametric multivariate analyses. Journal of Experimental Marine Biology and Ecology, 483: 147–155, DOI: 10.1016/j.jembe.2016.07.010.
Datta, S., and Datta, S., 2003. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19 (4): 459–466.
Dawyndt, P., Meyer, H. D., and Baets, B. D., 2006. UPGMA clustering revisited: A weight-driven approach to transitive approximation. International Journal of Approximate Reasoning, 42 (3): 174–191, DOI: 10.1016/j.ijar.2005.11.001.
Doherty, M., Tamura, M., Vriezen, J. A. C., Mcmanus, G. B., and Katz, L. A., 2010. Diversity of oligotrichia and choreotrichia ciliates in coastal marine sediments and in overlying plankton. Applied Environmental Microbiology, 76 (12): 3924–3935, DOI: 10.1128/AEM.01604-09.
Dunstan, D. J., and Bushby, A. J., 2013. The scaling exponent in the size effect of small scale plastic deformation. International Journal of Plasticity, 40 (1): 152–162, DOI: 10.1016/j.ijplas. 2012.08.002.
Everitt, B., 1980. Cluster analysis. Quality and Quantity, 14 (1): 75–100.
Fraley, C., and Raftery, A. E., 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41 (8): 578–588.
Fraley, C., and Raftery, A. E., 2003. Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. Journal of Classification, 20 (2): 263–286, DOI: 10.1007/s00357-003-0015-3.
Gehrke, J., Gunopulos, D., and Raghavan, P., 2005. Automatic subspace clustering of high dimensional data. Data Mining Knowledge Discovery, 11 (1): 5–33.
Giakoumi, S., Sini, M., Gerovasileiou, V., Mazor, T., Beher, J., Possingham, H. P., Abdulla, A., Cinar, M. E., Dendrinos, P., Gucu, A. C., Karamanlidis, A. A., Rodic, P., Panayotidis, P., Taskin, E., Jaklin, A., Voultsiadou, E., Webster, C., Zenetos, A., and Katsanevakis, A., 2013. Ecoregion-based conservation planning in the Mediterranean: Dealing with large-scale heterogeneity. PLoS One, 8 (10): e76449, DOI: 10.1371/journal.pone. 0076449.
Green, R. H., 1980. Multivariate approaches in ecology: The assessment of ecologic similarity. Annual Review of Ecology and Systematics, 11 (1): 1–14, DOI: 10.1146/annurev.es.11.110180.000245.
Hui, F. K. C., 2017. Model-based simultaneous clustering and ordination of multivariate abundance data in ecology. Computational Statistics & Data Analysis, 105: 1–10, DOI: 10.1016/ j.csda.2016.07.008.
Jackson, J. B. C., Kirby, M. X., Berger, W. H., Bjorndal, K. A., Botsford, L. W., Bourque, B. J., Bradbury, R. H., Cooke, R., Erlandson, J., Estes, J. A., Hughes, T. P., Kidwell, S., Lange, C. B., Lenihan, H. S., Pandolfi, J. M., Peterson, C. H., Steneck, R. S., Tegner, M. J., and Warner, R. R., 2001. Historical overfishing and the recent collapse of coastal ecosystems. Science, 293(5530): 629–638.
Jain, A. K., 2008. Data clustering: 50 years beyond K-means. Machine Learning and Knowledge Discovery in Databases, 31(8): 651–666, DOI: 10.1016/j.patrec.2009.09.011.
Jain, A. K., and Chen, H., 2004. Matching of dental X-ray images for human identification. Pattern Recognition, 37 (7): 1519- 1532.
Jain, A. K., Topchy, A. P., Law, M. H. C., and Buhmann, J. M., 2004. Landscape of clustering algorithms. International Conference on Pattern Recognition, 1: 260–263, DOI: 10.1109/ICPR. 2004.1334073.
James, G. M., and Sugar, C. A., 2003. Clustering for sparsely sampled functional data. Publications of the American Statistical Association, 98 (462): 397–408, DOI: 10.1198/016214503000 189.
Jin, X., and Han, J., 2016. Expectation maximization clustering. In: Encyclopedia of Machine Learning. Sammut, C., and Webb, G. I., eds., Springer US, 382–383.
Kassambara, A., and Mundt, F., 2016. Factoextra: Extract and visualize the results of multivariate data analyses. R Package Version, 1 (3): 2016.
Kaufman, L., and Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc., New York, 368–369.
Khondoker, M., Dobson, R., Skirrow, C., Simmons, A., and Stahl, D., 2016. A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies. Statistical Methods in Medical Research, 25 (5): 1804–1823.
Kushary, D., 2012. The EM algorithm and extensions. Technometrics, 40 (3): 260–260, DOI: 10.1080/00401706.1998.10485534.
Li, W., Wooley, J., and Godzik, A., 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 3(10): e3375, DOI: 10.1371/journal.pone.0003375.
Li, Y. J., and Xu, L. Y., 2007. Improvement for unweighted pair group method with arithmetic mean and its application. Journal of Beijing University of Technology, 33 (12): 1333–1339.
Lindsay, B., Mclachlan, G. J., Basford, K. E., and Dekker, M., 1989. Mixture models: Inference and applications to clustering. Applied Statistics, 84 (405): 337, DOI: 10.2307/2289892.
Lloyd, S., 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28 (2): 129–137.
Markovic, I. P., Stankovic, J., and Stankovic, J. M., 2013. Data preparation for modeling predictive analizes in the field of basic property insurance risks. Telecommunications Forum (TELFOR), Belgrade, Serbia, 829–832, DOI: 10.1109/TELFOR.2013.6716358.
Maulik, U., and Bandyopadhyay, S., 2002. Performance evaluation of some clustering algorithms and validity indices. Transactions on Pattern Analysis Machine Intelligence, 24 (12): 1650–1654.
McCabe, G. P., Sneath, P. H. A., and Sokal, R. R., 1975. Numerical taxonomy: The principles and practice of numerical classification. Journal of the American Statistical Association, 70 (352): 962, DOI: 10.2307/2285473.
Milligan, G. W., and Cooper, M. C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50: 159–179.
Okubo, N., Motokawa, T., and Omori, M., 2006. When fragmented coral spawn? Effect of size and timing on survivorship and fecundity of fragmentation in Acropora formosa. Marine Biology, 151 (1): 353–363, DOI: 10.1007/s00227-006-0490-2.
Pais, M. P., Henriques, S., Batista, M. I., Costa, M. J., and Cabral, H., 2013. Seeking functional homogeneity: A framework for definition and classification of fish assemblage types to support assessment tools on temperate reefs. Ecological Indicators, 34 (6): 231–245, DOI: 10.1016/j.ecolind.2013.05.006.
Pearson, R. G., Raxworthy, C. J., Nakamura, M., and Peterson, A. T., 2010. Predicting species distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar. Journal of Biogeography, 34 (1): 102–117, DOI: 10.1111/j.1365-2699.2006.01594.x.
Peña, M., 2018. Robust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry. Fisheries Research, 200: 49–60, DOI: 10.1016/j.fishres.2017.12.013.
Pielou, E. C., 1966. Species-diversity and pattern-diversity in the study of ecological succession. Journal of Theoretical Biology, 10 (2): 370–383, DOI: 10.1016/0022-5193(66)90133-0.
Sutherland, E. R., Goleva, E., King, T. S., Lehman, E., Stevens, A. D., Jackson, S. P., Stream, A. R., Fahy, J. V., and Leung, D. Y. M., 2012. Cluster analysis of obesity and asthma phenotypes. PLoS One, 7 (5): e36631.
Richter, C., Thompson, W. H., Bosman, C. A., and Fries, P., 2015. A jackknife approach to quantifying single-trial correlation between covariance-based metrics undefined on a single-trial basis. Neuroimage, 114: 57–70, DOI: 10.1016/j.neuroimage. 2015.04.040.
Ricketts, T., and Imhoff, M., 2003. Biodiversity, urban areas, and agriculture: Locating priority ecoregions for conservation. Conservation Ecology, 8 (2): 1850–1851.
Rousseeuw, P. J., 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65.
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics, 6 (2): 461–464.
Smith, S. P., and Jain, A. K., 1988. Test to determine the multivariate normality of a data set. IEEE Transactions on Pattern Analysis & Machine Intelligence, 10 (5): 757–761, DOI: 10.1109/34.6789.
Tabachnick, B. G., Fidell, L. S., and Ullman, J. B., 2007. Using Multivariate Statistics. Pearson Boston, MA, 676–780.
Templ, M., Filzmoser, P., and Reimann, C., 2008. Cluster analysis applied to regional geochemical data: Problems and possibilities. Applied Geochemistry, 23 (8): 2198–2213.
Thorndike, R. L., 1953. Who belongs in the family? Psychometrika, 18 (4): 267–276.
Valentine-Rose, L., Layman, C. A., Arrington, D. A., and Rypel, A. L., 2007. Habitat fragmentation decreases fish secondary production in Bahamian tidal creeks. Bulletin of Marine Science, 80 (3): 863–877.
Valter, D. G., and Marcello, P., 2008. Agglomeration within and between regions: Two econometric based indicators. Temi di Discussione Economic Working Papers, 674. Bank of Italy. DOI: 10.2139/ssrn.1160174.
Vaudor, L., Lamouroux, N., and Olivier, J. M., 2011. Comparing distribution models for small samples of overdispersed counts of freshwater fish. Acta Oecologica, 37 (3): 170–178.
Wang, J., Xu, B., Zhang, C., Xue, Y., Chen, Y., and Ren, Y., 2018. Evaluation of alternative stratifications for a stratified random fishery-independent survey. Fisheries Research, 207: 150–159, DOI: 10.1016/j.fishres.2018.06.019.
Wang, J., Zhou, N., Xu, B., Hao, H., Kang, L., Zheng, Y., Jiang, Y., and Jiang, H., 2012. Identification and cluster analysis of Streptococcus pyogenes by MALDI-TOF mass spectrometry. PLoS One, 7 (11): e47152.
Wikramanayake, E., Dinerstein, E., Loucks, C. J., Olson, D., Morrison, J., Lamoreaux, J., Mcknight, M., and Hedao, P., 2002. Terrestrial Ecoregions of the Indo-Pacific: A Conservation Assessment. Island Press, Washington, DC, 643pp.
Xi, H., Bigelow, K. A., and Boggs, C. H., 1997. Cluster analysis of longline sets and fishing strategies within the Hawaii-based fishery. Fisheries Research, 31 (1-2): 147–158.
Ysebaert, T., Herman, P. M. J., Meire, P., Craeymeersch, J., Verbeek, H., and Heip, C. H. R., 2003. Large-scale spatial patterns in estuaries: Estuarine macrobenthic communities in the Schelde Estuary, NW Europe. Estuarine Coastal Shelf Science, 57 (1): 335–355, DOI: 10.1016/S0272-7714(02)00359-1.
Zeng, L., Zhou, L., Guo, D., Fu, D., Xu, P., Zeng, S., Tang, Q., Chen, A., Chen, F., Luo, Y., and Li, G., 2017. Ecological effects of dams, alien fish, and physiochemical environmental factors on homogeneity/heterogeneity of fish community in four tributaries of the Pearl River in China. Ecology and Evolution, 7(1): 3904–3915, DOI: 10.1002/ece3.2920.
Acknowledgements
We thank members of the Fishery Ecosystem Monitoring and Assessment Laboratory of Ocean University of China for sample collection and treatments. Funding for this study was provided by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (No. 2018SDKJ0501-2).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wo, J., Zhang, C., Xu, B. et al. Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data. J. Ocean Univ. China 19, 659–668 (2020). https://doi.org/10.1007/s11802-020-4200-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11802-020-4200-3