Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

Wo, Jia; Zhang, Chongliang; Xu, Binduo; Xue, Ying; Ren, Yiping

doi:10.1007/s11802-020-4200-3

Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

Published: 02 May 2020

Volume 19, pages 659–668, (2020)
Cite this article

Journal of Ocean University of China Aims and scope Submit manuscript

Jia Wo¹,
Chongliang Zhang¹,
Binduo Xu¹,
Ying Xue¹ &
…
Yiping Ren^1,2

110 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering (HC), K-means (KM), and expectation-maximization (EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap (APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for comparative assessment of the results of cluster analysis of hydrobiocenoses structure (by the example of zooplankton communities of the Linda River, Nizhny Novgorod region)

Article 01 April 2016

Biogeography-Based Optimization for Cluster Analysis

Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

Article 23 December 2017

References

Altman, N., and Krzywinski, M., 2017. Points of Significance: Clustering. Nature Methods, 14 (6): 545–546, DOI: 10.1038/ nmeth.4299.
Article Google Scholar
Arlia, D., and Coppola, M., 2001. Experiments in parallel clustering with DBSCAN. In: Euro-Par 2001 Parallel Processing. Euro-Par 2001. Lecture Notes in Computer Science, Vol 2150. Sakellariou, R., et al., eds., Springer, Berlin, 326–331, DOI: 10.1007/3-540-44681-8_46.
Google Scholar
Arreguín-Sánchez, F., 1996. Catchability: A key parameter for fish stock assessment. Reviews in Fish Biology and Fisheries, 6 (2): 221–242.
Article Google Scholar
Brock, G., Pihur, V., Datta, S., and Datta, S., 2011. clValid, an R package for cluster validation. Journal of Statistical Software, 25: 1–22.
Google Scholar
Cao, Y., Bark, A. W., and Williams, W. P., 1997. A comparison of clustering methods for river benthic community analysis. Hydrobiologia, 347 (1-3): 24–40.
Article Google Scholar
Clarke, K. R., Somerfield, P., and Gorley, R. N., 2016. Clustering in non-parametric multivariate analyses. Journal of Experimental Marine Biology and Ecology, 483: 147–155, DOI: 10.1016/j.jembe.2016.07.010.
Article Google Scholar
Datta, S., and Datta, S., 2003. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19 (4): 459–466.
Article Google Scholar
Dawyndt, P., Meyer, H. D., and Baets, B. D., 2006. UPGMA clustering revisited: A weight-driven approach to transitive approximation. International Journal of Approximate Reasoning, 42 (3): 174–191, DOI: 10.1016/j.ijar.2005.11.001.
Article Google Scholar
Doherty, M., Tamura, M., Vriezen, J. A. C., Mcmanus, G. B., and Katz, L. A., 2010. Diversity of oligotrichia and choreotrichia ciliates in coastal marine sediments and in overlying plankton. Applied Environmental Microbiology, 76 (12): 3924–3935, DOI: 10.1128/AEM.01604-09.
Article Google Scholar
Dunstan, D. J., and Bushby, A. J., 2013. The scaling exponent in the size effect of small scale plastic deformation. International Journal of Plasticity, 40 (1): 152–162, DOI: 10.1016/j.ijplas. 2012.08.002.
Article Google Scholar
Everitt, B., 1980. Cluster analysis. Quality and Quantity, 14 (1): 75–100.
Article Google Scholar
Fraley, C., and Raftery, A. E., 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41 (8): 578–588.
Article Google Scholar
Fraley, C., and Raftery, A. E., 2003. Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. Journal of Classification, 20 (2): 263–286, DOI: 10.1007/s00357-003-0015-3.
Article Google Scholar
Gehrke, J., Gunopulos, D., and Raghavan, P., 2005. Automatic subspace clustering of high dimensional data. Data Mining Knowledge Discovery, 11 (1): 5–33.
Article Google Scholar
Giakoumi, S., Sini, M., Gerovasileiou, V., Mazor, T., Beher, J., Possingham, H. P., Abdulla, A., Cinar, M. E., Dendrinos, P., Gucu, A. C., Karamanlidis, A. A., Rodic, P., Panayotidis, P., Taskin, E., Jaklin, A., Voultsiadou, E., Webster, C., Zenetos, A., and Katsanevakis, A., 2013. Ecoregion-based conservation planning in the Mediterranean: Dealing with large-scale heterogeneity. PLoS One, 8 (10): e76449, DOI: 10.1371/journal.pone. 0076449.
Google Scholar
Green, R. H., 1980. Multivariate approaches in ecology: The assessment of ecologic similarity. Annual Review of Ecology and Systematics, 11 (1): 1–14, DOI: 10.1146/annurev.es.11.110180.000245.
Article Google Scholar
Hui, F. K. C., 2017. Model-based simultaneous clustering and ordination of multivariate abundance data in ecology. Computational Statistics & Data Analysis, 105: 1–10, DOI: 10.1016/ j.csda.2016.07.008.
Article Google Scholar
Jackson, J. B. C., Kirby, M. X., Berger, W. H., Bjorndal, K. A., Botsford, L. W., Bourque, B. J., Bradbury, R. H., Cooke, R., Erlandson, J., Estes, J. A., Hughes, T. P., Kidwell, S., Lange, C. B., Lenihan, H. S., Pandolfi, J. M., Peterson, C. H., Steneck, R. S., Tegner, M. J., and Warner, R. R., 2001. Historical overfishing and the recent collapse of coastal ecosystems. Science, 293(5530): 629–638.
Article Google Scholar
Jain, A. K., 2008. Data clustering: 50 years beyond K-means. Machine Learning and Knowledge Discovery in Databases, 31(8): 651–666, DOI: 10.1016/j.patrec.2009.09.011.
Google Scholar
Jain, A. K., and Chen, H., 2004. Matching of dental X-ray images for human identification. Pattern Recognition, 37 (7): 1519- 1532.
Google Scholar
Jain, A. K., Topchy, A. P., Law, M. H. C., and Buhmann, J. M., 2004. Landscape of clustering algorithms. International Conference on Pattern Recognition, 1: 260–263, DOI: 10.1109/ICPR. 2004.1334073.
Google Scholar
James, G. M., and Sugar, C. A., 2003. Clustering for sparsely sampled functional data. Publications of the American Statistical Association, 98 (462): 397–408, DOI: 10.1198/016214503000 189.
Article Google Scholar
Jin, X., and Han, J., 2016. Expectation maximization clustering. In: Encyclopedia of Machine Learning. Sammut, C., and Webb, G. I., eds., Springer US, 382–383.
Google Scholar
Kassambara, A., and Mundt, F., 2016. Factoextra: Extract and visualize the results of multivariate data analyses. R Package Version, 1 (3): 2016.
Google Scholar
Kaufman, L., and Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc., New York, 368–369.
Book Google Scholar
Khondoker, M., Dobson, R., Skirrow, C., Simmons, A., and Stahl, D., 2016. A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies. Statistical Methods in Medical Research, 25 (5): 1804–1823.
Article Google Scholar
Kushary, D., 2012. The EM algorithm and extensions. Technometrics, 40 (3): 260–260, DOI: 10.1080/00401706.1998.10485534.
Article Google Scholar
Li, W., Wooley, J., and Godzik, A., 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 3(10): e3375, DOI: 10.1371/journal.pone.0003375.
Google Scholar
Li, Y. J., and Xu, L. Y., 2007. Improvement for unweighted pair group method with arithmetic mean and its application. Journal of Beijing University of Technology, 33 (12): 1333–1339.
Google Scholar
Lindsay, B., Mclachlan, G. J., Basford, K. E., and Dekker, M., 1989. Mixture models: Inference and applications to clustering. Applied Statistics, 84 (405): 337, DOI: 10.2307/2289892.
Google Scholar
Lloyd, S., 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28 (2): 129–137.
Article Google Scholar
Markovic, I. P., Stankovic, J., and Stankovic, J. M., 2013. Data preparation for modeling predictive analizes in the field of basic property insurance risks. Telecommunications Forum (TELFOR), Belgrade, Serbia, 829–832, DOI: 10.1109/TELFOR.2013.6716358.
Google Scholar
Maulik, U., and Bandyopadhyay, S., 2002. Performance evaluation of some clustering algorithms and validity indices. Transactions on Pattern Analysis Machine Intelligence, 24 (12): 1650–1654.
Article Google Scholar
McCabe, G. P., Sneath, P. H. A., and Sokal, R. R., 1975. Numerical taxonomy: The principles and practice of numerical classification. Journal of the American Statistical Association, 70 (352): 962, DOI: 10.2307/2285473.
Article Google Scholar
Milligan, G. W., and Cooper, M. C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50: 159–179.
Article Google Scholar
Okubo, N., Motokawa, T., and Omori, M., 2006. When fragmented coral spawn? Effect of size and timing on survivorship and fecundity of fragmentation in Acropora formosa. Marine Biology, 151 (1): 353–363, DOI: 10.1007/s00227-006-0490-2.
Article Google Scholar
Pais, M. P., Henriques, S., Batista, M. I., Costa, M. J., and Cabral, H., 2013. Seeking functional homogeneity: A framework for definition and classification of fish assemblage types to support assessment tools on temperate reefs. Ecological Indicators, 34 (6): 231–245, DOI: 10.1016/j.ecolind.2013.05.006.
Article Google Scholar
Pearson, R. G., Raxworthy, C. J., Nakamura, M., and Peterson, A. T., 2010. Predicting species distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar. Journal of Biogeography, 34 (1): 102–117, DOI: 10.1111/j.1365-2699.2006.01594.x.
Article Google Scholar
Peña, M., 2018. Robust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry. Fisheries Research, 200: 49–60, DOI: 10.1016/j.fishres.2017.12.013.
Article Google Scholar
Pielou, E. C., 1966. Species-diversity and pattern-diversity in the study of ecological succession. Journal of Theoretical Biology, 10 (2): 370–383, DOI: 10.1016/0022-5193(66)90133-0.
Article Google Scholar
Sutherland, E. R., Goleva, E., King, T. S., Lehman, E., Stevens, A. D., Jackson, S. P., Stream, A. R., Fahy, J. V., and Leung, D. Y. M., 2012. Cluster analysis of obesity and asthma phenotypes. PLoS One, 7 (5): e36631.
Google Scholar
Richter, C., Thompson, W. H., Bosman, C. A., and Fries, P., 2015. A jackknife approach to quantifying single-trial correlation between covariance-based metrics undefined on a single-trial basis. Neuroimage, 114: 57–70, DOI: 10.1016/j.neuroimage. 2015.04.040.
Article Google Scholar
Ricketts, T., and Imhoff, M., 2003. Biodiversity, urban areas, and agriculture: Locating priority ecoregions for conservation. Conservation Ecology, 8 (2): 1850–1851.
Article Google Scholar
Rousseeuw, P. J., 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65.
Article Google Scholar
Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics, 6 (2): 461–464.
Article Google Scholar
Smith, S. P., and Jain, A. K., 1988. Test to determine the multivariate normality of a data set. IEEE Transactions on Pattern Analysis & Machine Intelligence, 10 (5): 757–761, DOI: 10.1109/34.6789.
Article Google Scholar
Tabachnick, B. G., Fidell, L. S., and Ullman, J. B., 2007. Using Multivariate Statistics. Pearson Boston, MA, 676–780.
Google Scholar
Templ, M., Filzmoser, P., and Reimann, C., 2008. Cluster analysis applied to regional geochemical data: Problems and possibilities. Applied Geochemistry, 23 (8): 2198–2213.
Article Google Scholar
Thorndike, R. L., 1953. Who belongs in the family? Psychometrika, 18 (4): 267–276.
Article Google Scholar
Valentine-Rose, L., Layman, C. A., Arrington, D. A., and Rypel, A. L., 2007. Habitat fragmentation decreases fish secondary production in Bahamian tidal creeks. Bulletin of Marine Science, 80 (3): 863–877.
Google Scholar
Valter, D. G., and Marcello, P., 2008. Agglomeration within and between regions: Two econometric based indicators. Temi di Discussione Economic Working Papers, 674. Bank of Italy. DOI: 10.2139/ssrn.1160174.
Google Scholar
Vaudor, L., Lamouroux, N., and Olivier, J. M., 2011. Comparing distribution models for small samples of overdispersed counts of freshwater fish. Acta Oecologica, 37 (3): 170–178.
Article Google Scholar
Wang, J., Xu, B., Zhang, C., Xue, Y., Chen, Y., and Ren, Y., 2018. Evaluation of alternative stratifications for a stratified random fishery-independent survey. Fisheries Research, 207: 150–159, DOI: 10.1016/j.fishres.2018.06.019.
Article Google Scholar
Wang, J., Zhou, N., Xu, B., Hao, H., Kang, L., Zheng, Y., Jiang, Y., and Jiang, H., 2012. Identification and cluster analysis of Streptococcus pyogenes by MALDI-TOF mass spectrometry. PLoS One, 7 (11): e47152.
Google Scholar
Wikramanayake, E., Dinerstein, E., Loucks, C. J., Olson, D., Morrison, J., Lamoreaux, J., Mcknight, M., and Hedao, P., 2002. Terrestrial Ecoregions of the Indo-Pacific: A Conservation Assessment. Island Press, Washington, DC, 643pp.
Google Scholar
Xi, H., Bigelow, K. A., and Boggs, C. H., 1997. Cluster analysis of longline sets and fishing strategies within the Hawaii-based fishery. Fisheries Research, 31 (1-2): 147–158.
Article Google Scholar
Ysebaert, T., Herman, P. M. J., Meire, P., Craeymeersch, J., Verbeek, H., and Heip, C. H. R., 2003. Large-scale spatial patterns in estuaries: Estuarine macrobenthic communities in the Schelde Estuary, NW Europe. Estuarine Coastal Shelf Science, 57 (1): 335–355, DOI: 10.1016/S0272-7714(02)00359-1.
Article Google Scholar
Zeng, L., Zhou, L., Guo, D., Fu, D., Xu, P., Zeng, S., Tang, Q., Chen, A., Chen, F., Luo, Y., and Li, G., 2017. Ecological effects of dams, alien fish, and physiochemical environmental factors on homogeneity/heterogeneity of fish community in four tributaries of the Pearl River in China. Ecology and Evolution, 7(1): 3904–3915, DOI: 10.1002/ece3.2920.
Article Google Scholar

Download references

Acknowledgements

We thank members of the Fishery Ecosystem Monitoring and Assessment Laboratory of Ocean University of China for sample collection and treatments. Funding for this study was provided by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (No. 2018SDKJ0501-2).

Author information

Authors and Affiliations

College of Fisheries, Ocean University of China, Qingdao, 266003, China
Jia Wo, Chongliang Zhang, Binduo Xu, Ying Xue & Yiping Ren
Laboratory for Marine Science and Food Production Processes, National Laboratory for Marine Science and Technology, Qingdao, 266237, China
Yiping Ren

Authors

Jia Wo
View author publications
You can also search for this author in PubMed Google Scholar
Chongliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Binduo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yiping Ren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiping Ren.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wo, J., Zhang, C., Xu, B. et al. Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data. J. Ocean Univ. China 19, 659–668 (2020). https://doi.org/10.1007/s11802-020-4200-3

Download citation

Received: 12 April 2019
Revised: 04 August 2019
Accepted: 20 March 2020
Published: 02 May 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11802-020-4200-3

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

Abstract

Access this article

Similar content being viewed by others

Methods for comparative assessment of the results of cluster analysis of hydrobiocenoses structure (by the example of zooplankton communities of the Linda River, Nizhny Novgorod region)

Biogeography-Based Optimization for Cluster Analysis

Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Key words

Navigation

Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

Abstract

Access this article

Similar content being viewed by others

Methods for comparative assessment of the results of cluster analysis of hydrobiocenoses structure (by the example of zooplankton communities of the Linda River, Nizhny Novgorod region)

Biogeography-Based Optimization for Cluster Analysis

Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation