Skip to main content

Advertisement

Log in

Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data

  • Published:
Journal of Ocean University of China Aims and scope Submit manuscript

Abstract

Clustering is a group of unsupervised statistical techniques commonly used in many disciplines. Considering their applications to fish abundance data, many technical details need to be considered to ensure reasonable interpretation. However, the reliability and stability of the clustering methods have rarely been studied in the contexts of fisheries. This study presents an intensive evaluation of three common clustering methods, including hierarchical clustering (HC), K-means (KM), and expectation-maximization (EM) methods, based on fish community surveys in the coastal waters of Shandong, China. We evaluated the performances of these three methods considering different numbers of clusters, data size, and data transformation approaches, focusing on the consistency validation using the index of average proportion of non-overlap (APN). The results indicate that the three methods tend to be inconsistent in the optimal number of clusters. EM showed relatively better performances to avoid unbalanced classification, whereas HC and KM provided more stable clustering results. Data transformation including scaling, square-root, and log-transformation had substantial influences on the clustering results, especially for KM. Moreover, transformation also influenced clustering stability, wherein scaling tended to provide a stable solution at the same number of clusters. The APN values indicated improved stability with increasing data size, and the effect leveled off over 70 samples in general and most quickly in EM. We conclude that the best clustering method can be chosen depending on the aim of the study and the number of clusters. In general, KM is relatively robust in our tests. We also provide recommendations for future application of clustering analyses. This study is helpful to ensure the credibility of the application and interpretation of clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Altman, N., and Krzywinski, M., 2017. Points of Significance: Clustering. Nature Methods, 14 (6): 545–546, DOI: 10.1038/ nmeth.4299.

    Article  Google Scholar 

  • Arlia, D., and Coppola, M., 2001. Experiments in parallel clustering with DBSCAN. In: Euro-Par 2001 Parallel Processing. Euro-Par 2001. Lecture Notes in Computer Science, Vol 2150. Sakellariou, R., et al., eds., Springer, Berlin, 326–331, DOI: 10.1007/3-540-44681-8_46.

    Google Scholar 

  • Arreguín-Sánchez, F., 1996. Catchability: A key parameter for fish stock assessment. Reviews in Fish Biology and Fisheries, 6 (2): 221–242.

    Article  Google Scholar 

  • Brock, G., Pihur, V., Datta, S., and Datta, S., 2011. clValid, an R package for cluster validation. Journal of Statistical Software, 25: 1–22.

    Google Scholar 

  • Cao, Y., Bark, A. W., and Williams, W. P., 1997. A comparison of clustering methods for river benthic community analysis. Hydrobiologia, 347 (1-3): 24–40.

    Article  Google Scholar 

  • Clarke, K. R., Somerfield, P., and Gorley, R. N., 2016. Clustering in non-parametric multivariate analyses. Journal of Experimental Marine Biology and Ecology, 483: 147–155, DOI: 10.1016/j.jembe.2016.07.010.

    Article  Google Scholar 

  • Datta, S., and Datta, S., 2003. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19 (4): 459–466.

    Article  Google Scholar 

  • Dawyndt, P., Meyer, H. D., and Baets, B. D., 2006. UPGMA clustering revisited: A weight-driven approach to transitive approximation. International Journal of Approximate Reasoning, 42 (3): 174–191, DOI: 10.1016/j.ijar.2005.11.001.

    Article  Google Scholar 

  • Doherty, M., Tamura, M., Vriezen, J. A. C., Mcmanus, G. B., and Katz, L. A., 2010. Diversity of oligotrichia and choreotrichia ciliates in coastal marine sediments and in overlying plankton. Applied Environmental Microbiology, 76 (12): 3924–3935, DOI: 10.1128/AEM.01604-09.

    Article  Google Scholar 

  • Dunstan, D. J., and Bushby, A. J., 2013. The scaling exponent in the size effect of small scale plastic deformation. International Journal of Plasticity, 40 (1): 152–162, DOI: 10.1016/j.ijplas. 2012.08.002.

    Article  Google Scholar 

  • Everitt, B., 1980. Cluster analysis. Quality and Quantity, 14 (1): 75–100.

    Article  Google Scholar 

  • Fraley, C., and Raftery, A. E., 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41 (8): 578–588.

    Article  Google Scholar 

  • Fraley, C., and Raftery, A. E., 2003. Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST. Journal of Classification, 20 (2): 263–286, DOI: 10.1007/s00357-003-0015-3.

    Article  Google Scholar 

  • Gehrke, J., Gunopulos, D., and Raghavan, P., 2005. Automatic subspace clustering of high dimensional data. Data Mining Knowledge Discovery, 11 (1): 5–33.

    Article  Google Scholar 

  • Giakoumi, S., Sini, M., Gerovasileiou, V., Mazor, T., Beher, J., Possingham, H. P., Abdulla, A., Cinar, M. E., Dendrinos, P., Gucu, A. C., Karamanlidis, A. A., Rodic, P., Panayotidis, P., Taskin, E., Jaklin, A., Voultsiadou, E., Webster, C., Zenetos, A., and Katsanevakis, A., 2013. Ecoregion-based conservation planning in the Mediterranean: Dealing with large-scale heterogeneity. PLoS One, 8 (10): e76449, DOI: 10.1371/journal.pone. 0076449.

    Google Scholar 

  • Green, R. H., 1980. Multivariate approaches in ecology: The assessment of ecologic similarity. Annual Review of Ecology and Systematics, 11 (1): 1–14, DOI: 10.1146/annurev.es.11.110180.000245.

    Article  Google Scholar 

  • Hui, F. K. C., 2017. Model-based simultaneous clustering and ordination of multivariate abundance data in ecology. Computational Statistics & Data Analysis, 105: 1–10, DOI: 10.1016/ j.csda.2016.07.008.

    Article  Google Scholar 

  • Jackson, J. B. C., Kirby, M. X., Berger, W. H., Bjorndal, K. A., Botsford, L. W., Bourque, B. J., Bradbury, R. H., Cooke, R., Erlandson, J., Estes, J. A., Hughes, T. P., Kidwell, S., Lange, C. B., Lenihan, H. S., Pandolfi, J. M., Peterson, C. H., Steneck, R. S., Tegner, M. J., and Warner, R. R., 2001. Historical overfishing and the recent collapse of coastal ecosystems. Science, 293(5530): 629–638.

    Article  Google Scholar 

  • Jain, A. K., 2008. Data clustering: 50 years beyond K-means. Machine Learning and Knowledge Discovery in Databases, 31(8): 651–666, DOI: 10.1016/j.patrec.2009.09.011.

    Google Scholar 

  • Jain, A. K., and Chen, H., 2004. Matching of dental X-ray images for human identification. Pattern Recognition, 37 (7): 1519- 1532.

    Google Scholar 

  • Jain, A. K., Topchy, A. P., Law, M. H. C., and Buhmann, J. M., 2004. Landscape of clustering algorithms. International Conference on Pattern Recognition, 1: 260–263, DOI: 10.1109/ICPR. 2004.1334073.

    Google Scholar 

  • James, G. M., and Sugar, C. A., 2003. Clustering for sparsely sampled functional data. Publications of the American Statistical Association, 98 (462): 397–408, DOI: 10.1198/016214503000 189.

    Article  Google Scholar 

  • Jin, X., and Han, J., 2016. Expectation maximization clustering. In: Encyclopedia of Machine Learning. Sammut, C., and Webb, G. I., eds., Springer US, 382–383.

    Google Scholar 

  • Kassambara, A., and Mundt, F., 2016. Factoextra: Extract and visualize the results of multivariate data analyses. R Package Version, 1 (3): 2016.

    Google Scholar 

  • Kaufman, L., and Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc., New York, 368–369.

    Book  Google Scholar 

  • Khondoker, M., Dobson, R., Skirrow, C., Simmons, A., and Stahl, D., 2016. A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies. Statistical Methods in Medical Research, 25 (5): 1804–1823.

    Article  Google Scholar 

  • Kushary, D., 2012. The EM algorithm and extensions. Technometrics, 40 (3): 260–260, DOI: 10.1080/00401706.1998.10485534.

    Article  Google Scholar 

  • Li, W., Wooley, J., and Godzik, A., 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 3(10): e3375, DOI: 10.1371/journal.pone.0003375.

    Google Scholar 

  • Li, Y. J., and Xu, L. Y., 2007. Improvement for unweighted pair group method with arithmetic mean and its application. Journal of Beijing University of Technology, 33 (12): 1333–1339.

    Google Scholar 

  • Lindsay, B., Mclachlan, G. J., Basford, K. E., and Dekker, M., 1989. Mixture models: Inference and applications to clustering. Applied Statistics, 84 (405): 337, DOI: 10.2307/2289892.

    Google Scholar 

  • Lloyd, S., 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28 (2): 129–137.

    Article  Google Scholar 

  • Markovic, I. P., Stankovic, J., and Stankovic, J. M., 2013. Data preparation for modeling predictive analizes in the field of basic property insurance risks. Telecommunications Forum (TELFOR), Belgrade, Serbia, 829–832, DOI: 10.1109/TELFOR.2013.6716358.

    Google Scholar 

  • Maulik, U., and Bandyopadhyay, S., 2002. Performance evaluation of some clustering algorithms and validity indices. Transactions on Pattern Analysis Machine Intelligence, 24 (12): 1650–1654.

    Article  Google Scholar 

  • McCabe, G. P., Sneath, P. H. A., and Sokal, R. R., 1975. Numerical taxonomy: The principles and practice of numerical classification. Journal of the American Statistical Association, 70 (352): 962, DOI: 10.2307/2285473.

    Article  Google Scholar 

  • Milligan, G. W., and Cooper, M. C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50: 159–179.

    Article  Google Scholar 

  • Okubo, N., Motokawa, T., and Omori, M., 2006. When fragmented coral spawn? Effect of size and timing on survivorship and fecundity of fragmentation in Acropora formosa. Marine Biology, 151 (1): 353–363, DOI: 10.1007/s00227-006-0490-2.

    Article  Google Scholar 

  • Pais, M. P., Henriques, S., Batista, M. I., Costa, M. J., and Cabral, H., 2013. Seeking functional homogeneity: A framework for definition and classification of fish assemblage types to support assessment tools on temperate reefs. Ecological Indicators, 34 (6): 231–245, DOI: 10.1016/j.ecolind.2013.05.006.

    Article  Google Scholar 

  • Pearson, R. G., Raxworthy, C. J., Nakamura, M., and Peterson, A. T., 2010. Predicting species distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar. Journal of Biogeography, 34 (1): 102–117, DOI: 10.1111/j.1365-2699.2006.01594.x.

    Article  Google Scholar 

  • Peña, M., 2018. Robust clustering methodology for multi-frequency acoustic data: A review of standardization, initialization and cluster geometry. Fisheries Research, 200: 49–60, DOI: 10.1016/j.fishres.2017.12.013.

    Article  Google Scholar 

  • Pielou, E. C., 1966. Species-diversity and pattern-diversity in the study of ecological succession. Journal of Theoretical Biology, 10 (2): 370–383, DOI: 10.1016/0022-5193(66)90133-0.

    Article  Google Scholar 

  • Sutherland, E. R., Goleva, E., King, T. S., Lehman, E., Stevens, A. D., Jackson, S. P., Stream, A. R., Fahy, J. V., and Leung, D. Y. M., 2012. Cluster analysis of obesity and asthma phenotypes. PLoS One, 7 (5): e36631.

    Google Scholar 

  • Richter, C., Thompson, W. H., Bosman, C. A., and Fries, P., 2015. A jackknife approach to quantifying single-trial correlation between covariance-based metrics undefined on a single-trial basis. Neuroimage, 114: 57–70, DOI: 10.1016/j.neuroimage. 2015.04.040.

    Article  Google Scholar 

  • Ricketts, T., and Imhoff, M., 2003. Biodiversity, urban areas, and agriculture: Locating priority ecoregions for conservation. Conservation Ecology, 8 (2): 1850–1851.

    Article  Google Scholar 

  • Rousseeuw, P. J., 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20: 53–65.

    Article  Google Scholar 

  • Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics, 6 (2): 461–464.

    Article  Google Scholar 

  • Smith, S. P., and Jain, A. K., 1988. Test to determine the multivariate normality of a data set. IEEE Transactions on Pattern Analysis & Machine Intelligence, 10 (5): 757–761, DOI: 10.1109/34.6789.

    Article  Google Scholar 

  • Tabachnick, B. G., Fidell, L. S., and Ullman, J. B., 2007. Using Multivariate Statistics. Pearson Boston, MA, 676–780.

    Google Scholar 

  • Templ, M., Filzmoser, P., and Reimann, C., 2008. Cluster analysis applied to regional geochemical data: Problems and possibilities. Applied Geochemistry, 23 (8): 2198–2213.

    Article  Google Scholar 

  • Thorndike, R. L., 1953. Who belongs in the family? Psychometrika, 18 (4): 267–276.

    Article  Google Scholar 

  • Valentine-Rose, L., Layman, C. A., Arrington, D. A., and Rypel, A. L., 2007. Habitat fragmentation decreases fish secondary production in Bahamian tidal creeks. Bulletin of Marine Science, 80 (3): 863–877.

    Google Scholar 

  • Valter, D. G., and Marcello, P., 2008. Agglomeration within and between regions: Two econometric based indicators. Temi di Discussione Economic Working Papers, 674. Bank of Italy. DOI: 10.2139/ssrn.1160174.

    Google Scholar 

  • Vaudor, L., Lamouroux, N., and Olivier, J. M., 2011. Comparing distribution models for small samples of overdispersed counts of freshwater fish. Acta Oecologica, 37 (3): 170–178.

    Article  Google Scholar 

  • Wang, J., Xu, B., Zhang, C., Xue, Y., Chen, Y., and Ren, Y., 2018. Evaluation of alternative stratifications for a stratified random fishery-independent survey. Fisheries Research, 207: 150–159, DOI: 10.1016/j.fishres.2018.06.019.

    Article  Google Scholar 

  • Wang, J., Zhou, N., Xu, B., Hao, H., Kang, L., Zheng, Y., Jiang, Y., and Jiang, H., 2012. Identification and cluster analysis of Streptococcus pyogenes by MALDI-TOF mass spectrometry. PLoS One, 7 (11): e47152.

    Google Scholar 

  • Wikramanayake, E., Dinerstein, E., Loucks, C. J., Olson, D., Morrison, J., Lamoreaux, J., Mcknight, M., and Hedao, P., 2002. Terrestrial Ecoregions of the Indo-Pacific: A Conservation Assessment. Island Press, Washington, DC, 643pp.

    Google Scholar 

  • Xi, H., Bigelow, K. A., and Boggs, C. H., 1997. Cluster analysis of longline sets and fishing strategies within the Hawaii-based fishery. Fisheries Research, 31 (1-2): 147–158.

    Article  Google Scholar 

  • Ysebaert, T., Herman, P. M. J., Meire, P., Craeymeersch, J., Verbeek, H., and Heip, C. H. R., 2003. Large-scale spatial patterns in estuaries: Estuarine macrobenthic communities in the Schelde Estuary, NW Europe. Estuarine Coastal Shelf Science, 57 (1): 335–355, DOI: 10.1016/S0272-7714(02)00359-1.

    Article  Google Scholar 

  • Zeng, L., Zhou, L., Guo, D., Fu, D., Xu, P., Zeng, S., Tang, Q., Chen, A., Chen, F., Luo, Y., and Li, G., 2017. Ecological effects of dams, alien fish, and physiochemical environmental factors on homogeneity/heterogeneity of fish community in four tributaries of the Pearl River in China. Ecology and Evolution, 7(1): 3904–3915, DOI: 10.1002/ece3.2920.

    Article  Google Scholar 

Download references

Acknowledgements

We thank members of the Fishery Ecosystem Monitoring and Assessment Laboratory of Ocean University of China for sample collection and treatments. Funding for this study was provided by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (No. 2018SDKJ0501-2).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiping Ren.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wo, J., Zhang, C., Xu, B. et al. Performances of Clustering Methods Considering Data Transformation and Sample Size: An Evaluation with Fisheries Survey Data. J. Ocean Univ. China 19, 659–668 (2020). https://doi.org/10.1007/s11802-020-4200-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11802-020-4200-3

Key words

Navigation