Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering

Yu, Hui; Wang, Qiao Feng; Shi, Jian Yu

doi:10.1007/s11063-023-11315-z

Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering

Published: 09 June 2023

Volume 55, pages 8365–8384, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Hui Yu¹,
Qiao Feng Wang¹ &
Jian Yu Shi²

225 Accesses
Explore all metrics

Abstract

In the field of data mining, the performance of clustering is largely affected by the number of samples. However, obtaining enough data samples in some applications is difficult and expensive. To solve this problem, data augmentation like the oversampling methods have been adopted, but these methods mainly focus more on the local information of the data, without considering its potential distribution. In this paper, a new data augmentation method is proposed, which is the Wasserstein Generation Adversarial Network based on the Gaussian Mixture Model (GMM_WGAN) to generate datasets for small samples, to solve the problem of insufficient dataset size in clustering. It includes two steps, in the first step we use the Gaussian Mixture Model to capture the potential distribution of the real dataset, and in the second step, we use Wasserstein generative adversarial network to generate data samples to expand the small size dataset. We utilize five clustering algorithms to evaluate GMM_WGAN performance and compare it with the other seven data enhancement methods. Experiments on 10 small size datasets demonstrate that the proposed approach achieves greater result than others based on five evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Augment in Imbalanced Learning Based on Generative Adversarial Networks

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

Article 05 March 2021

Noise-Robust Gaussian Distribution Based Imbalanced Oversampling

References

Jiao P, Yu W, Wang W, Li X, Sun Y (2018) Exploring temporal community structure and constant evolutionary pattern hiding in dynamic networks. Neurocomputing 314:224–233
Article Google Scholar
Khan MT, Azam N, Khalid S, Aziz F (2022) Hierarchical lifelong topic modeling using rules extracted from network communities. PLoS ONE, 17
Lian C, Ruan S, Denoeux T, Li H, Vera P (2018) Joint tumor segmentation in pet-ct images using co-clustering and fusion based on belief functions. IEEE Trans Image Process 28(2):755–766
Article MathSciNet MATH Google Scholar
Yu H, Mao K-T, Shi J-Y, Huang H, Chen Z, Dong K, Yiu S-M (2018) Predicting and understanding comprehensive drug-drug interactions via semi-nonnegative matrix factorization. BMC Syst Biol 12(1):101–110
Google Scholar
Yu H, Yuan CL, Yao JT, Wang XN (2019) A three-way clustering method based on an improved dbscan algorithm. Phys A Stat Mech Appl 535:122289
Article Google Scholar
Chao G (2019) Discriminative k-means Laplacian clustering. Neural Process Lett 49(1):393–405
Article Google Scholar
Han B, Wei Y, Kang L, Wang Q, Feng S (2022) Attributed multiplex graph clustering: a heuristic clustering-aware network embedding approach. Phys A Stat Mech Appl 592:126794
Article Google Scholar
Gu Z, Deng Z, Huang Y, Liu D, Zhang Z (2021) Subspace clustering via integrating sparse representation and adaptive graph learning. Neural Process Lett 53(6):4377–4388
Article Google Scholar
Pavel B (2006) A survey of clustering data mining techniques. In: Grouping multidimensional data, pp 25–71. Springer
Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004., vol 1, pp 260–263. IEEE
Kamiya K, Yuji A, Kato Y, Fujimura F, Takahashi M, Shoji N, Mori Y, Miyata K (2019) Keratoconus detection using deep learning of colour-coded maps with anterior segment optical coherence tomography: a diagnostic accuracy study. BMJ Open 9(9):e031313
Article Google Scholar
Yu H, Zhang C, Wang G (2016) A tree-based incremental overlapping clustering method using the three-way decision theory. Knowl Based Syst 91:189–203
Article Google Scholar
Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl Based Syst 155:54–65
Article Google Scholar
Lu H, Zhao Q, Sang X, Lu J (2020) Community detection in complex networks using nonnegative matrix factorization and density-based clustering algorithm. Neural Process Lett 51(2):1731–1748
Article Google Scholar
Zhu J, Jang-Jaccard J, Liu T, Zhou J (2021) Joint spectral clustering based on optimal graph and feature selection. Neural Process Lett 53(1):257–273
Article Google Scholar
Zhuang FZ, Luo P, He Q, Shi ZZ (2015) Survey on transfer learning research. J Softw 26(1):26–39
MathSciNet Google Scholar
Liu Y, Zhou Y, Liu X, Dong F, Wang C, Wang Z (2019) Wasserstein gan-based small-sample augmentation for new-generation artificial intelligence: a case study of cancer-staging data in biology. Engineering 5(1):156–163
Article Google Scholar
Deng M, Deng A, Zhu J, Shi Y, Liu Y (2021) Intelligent fault diagnosis of rotating components in the absence of fault data: a transfer-based approach. Measurement 173:108601
Article Google Scholar
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56
Article MathSciNet Google Scholar
Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans knowl Data Eng 22(10):1345–1359
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Zhang T, Chen J, Li F, Pan T, He S (2020) A small sample focused intelligent fault diagnosis scheme of machines via multimodules learning with gradient penalized generative adversarial networks. IEEE Trans Ind Electronics 68(10):10130–10141
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst, 27
Arjovsky M, Bottou L (2017) Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning, pp 214–223. PMLR
Kaloskampis I, Pugh D, Joshi C, Nolan L (2019) Synthetic data for public good-data science campus
Han H, Wang W-Yn, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing: international conference on intelligent computing, ICIC 2005, Hefei, China, 23–26 Aug 2005, Proceedings, Part I 1, pp 878–887. Springer
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp 1322–1328. IEEE
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Gou C, Wu Y, Wang K, Wang F-Y, Ji Q (2016) Learning-by-synthesis for accurate eye detection. In: 2016 23rd international conference on pattern recognition (ICPR), pp 3362–3367. IEEE
Zhang K, Chen Q, Chen J, He S, Fudong Li, Zhou Z (2022) A multi-module generative adversarial network augmented with adaptive decoupling strategy for intelligent fault diagnosis of machines with small sample. Knowl Based Syst 239:107980
Article Google Scholar
Ren J, Liu Y, Liu J (2019) Ewgan: Entropy-based wasserstein gan for imbalanced learning. Proc AAAI Conf Artif Intell 33:10011–10012
Google Scholar
Yu Y, Guo L, Gao H, Liu Y (2022) Pcwgan-gp: A new method for imbalanced fault diagnosis of machines. IEEE Trans Instrument Measure 71:1–11
Google Scholar
Fan J, Yuan X, Miao Z, Sun Z, Xe Mei, Zhou F (2022) Full attention wasserstein gan with gradient normalization for fault diagnosis under imbalanced data. IEEE Trans Instrument Measure 71:1–16
Google Scholar
Reynolds DA (2009) Gaussian mixture models. Encyclopedia Biometrics 741:659–663
Article Google Scholar
Franti P, Virmajoki O, Hautamaki V (2006) Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881
Article Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. Adv Neural Inf Process Syst, 30
Gurumurthy S, Sarvadevabhatla RK, Babu RVh (2017) Deligan: Generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 166–174
MacQueen J (1967) Classification and analysis of multivariate observations. 5th Berkeley Symp Math Statist Prob, pp 281–297
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. science, 344(6191):1492–1496
Bezdek JC, Ehrlich R, Full W (1984) Fcm: the fuzzy c-means clustering algorithm. Comput Ggeosci 10(2–3):191–203
Article Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. ACM Sigmod Record 25(2):103–114
Article Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. kdd 96:226–231
Google Scholar
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Article Google Scholar
Zimmerman DW, Zumbo BD (1993) Relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks. J Exp Educ 62(1):75–86
Article Google Scholar
Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61872297).

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
Hui Yu & Qiao Feng Wang
School of Life Sciences, Northwestern Polytechnical University, Xi’an, 710072, China
Jian Yu Shi

Authors

Hui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yu Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hui Yu or Jian Yu Shi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 96 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, H., Wang, Q.F. & Shi, J.Y. Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering. Neural Process Lett 55, 8365–8384 (2023). https://doi.org/10.1007/s11063-023-11315-z

Download citation

Accepted: 25 May 2023
Published: 09 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11315-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering

Abstract

Access this article

Similar content being viewed by others

Data Augment in Imbalanced Learning Based on Generative Adversarial Networks

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

Noise-Robust Gaussian Distribution Based Imbalanced Oversampling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 96 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering

Abstract

Access this article

Similar content being viewed by others

Data Augment in Imbalanced Learning Based on Generative Adversarial Networks

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

Noise-Robust Gaussian Distribution Based Imbalanced Oversampling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 96 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation