Advertisement

Using virtual samples to improve learning performance for small datasets with multimodal distributions

  • Der-Chiang Li
  • Liang-Sian LinEmail author
  • Chien-Chih Chen
  • Wei-Hao Yu
Methodologies and Application
  • 18 Downloads

Abstract

A small dataset that contains very few samples, a maximum of thirty as defined in traditional normal distribution statistics, often makes it difficult for learning algorithms to make precise predictions. In past studies, many virtual sample generation (VSG) approaches have been shown to be effective in overcoming this issue by adding virtual samples to training sets, with some methods creating samples based on their estimated sample distributions and directly treating the distributions as unimodal without considering that small data may actually present multimodal distributions. Accordingly, before estimating sample distributions, this paper employs density-based spatial clustering of applications with noise to cluster small data and applies the AICc (the corrected version of the Akaike information criterion for small datasets) to assess clustering results as an essential procedure in data pre-processing. Once the AICc shows that the clusters are appropriate to present the data dispersion of small datasets, each of their sample distributions is estimated by using the maximal p value (MPV) method to present multimodal distributions; otherwise, all of the data is inferred as having unimodal distributions. We call the proposed method multimodal MPV (MMPV). Based on the estimated distributions, virtual samples are created with a mechanism to evaluate suitable sample sizes. In the experiments, one real and two public datasets are examined, and the bagging (bootstrap aggregating) procedure is employed to build the models, where the models are support vector regressions with three kernel functions: linear, polynomial, and radial basis. The results show that the forecasting accuracies of the MMPV are significantly better than those of MPV, a VSG method developed based on fuzzy C-means, and REAL (using original training sets), based on most of the statistical results of the paired t test.

Keywords

Small data Multimodal distributions Virtual sample Clustering sizes 

Notes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York, p 2Google Scholar
  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723MathSciNetCrossRefzbMATHGoogle Scholar
  3. Akgül FG, Şenoğlu B, Arslan T (2016) An alternative distribution to Weibull for modeling the wind speed data: inverse Weibull distribution. Energy Convers Manag 114:234–240CrossRefGoogle Scholar
  4. Bernard A, Bos-Levenbach E (1953) The plotting of observations on probability-paper. Statistica Neerlandica 7:163–173MathSciNetCrossRefGoogle Scholar
  5. Blake C, Keogh E, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, CAGoogle Scholar
  6. Bowman K, Shenton L (2001) Weibull distributions when the shape parameter is defined. Comput Stat Data Anal 36:299–310MathSciNetCrossRefzbMATHGoogle Scholar
  7. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140zbMATHGoogle Scholar
  8. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482Google Scholar
  9. Burnham KP, Anderson DR (2004) Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res 33:261–304MathSciNetCrossRefGoogle Scholar
  10. Bütikofer L, Stawarczyk B, Roos M (2015) Two regression methods for estimation of a two-parameter Weibull distribution for reliability of dental materials. Dent Mater 31:e33–e50CrossRefGoogle Scholar
  11. Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 160–172Google Scholar
  12. Chen H, Cheng W, Mingzhong J (2018) Parameter estimation for generalized logistic distribution by estimating equations based on the order statistics. Commun Stat Theory Methods.  https://doi.org/10.1080/03610926.2018.1433854 Google Scholar
  13. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The second international conference on knowledge discovery and data mining (KDD'96). AAAI, pp 226–231Google Scholar
  14. Faloutsos C, Kamel I (1994) Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. In: Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. ACM, pp 4–13Google Scholar
  15. Gail M, Gastwirth J (1978) A scale-free goodness-of-fit test for the exponential distribution based on the Gini statistic. J R Stat Soc Ser B (Methodological) 40:350–357MathSciNetzbMATHGoogle Scholar
  16. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887Google Scholar
  17. Huang C (2002) Information diffusion techniques and small-sample problem. Int J Inf Technol Decis Mak 1:229–249CrossRefGoogle Scholar
  18. Huang C, Moraga C (2004) A diffusion-neural-network for learning from small samples. Int J Approx Reason 35:137–161MathSciNetCrossRefzbMATHGoogle Scholar
  19. Li DC, Lin LS (2013) A new approach to assess product lifetime performance for small data sets. Eur J Oper Res 230:290–298MathSciNetCrossRefzbMATHGoogle Scholar
  20. Li DC, Lin LS (2014) Generating information for small data sets with a multi-modal distribution. Decis Support Syst 66:71–81CrossRefGoogle Scholar
  21. Li DC, Wu CS, Tsai T-I, Lina Y-S (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982CrossRefzbMATHGoogle Scholar
  22. Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: IEEE symposium on computational intelligence and data mining (CIDM). pp 104–111Google Scholar
  23. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 14. Oakland, CA, USA. pp 281–297Google Scholar
  24. Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, DordrechtCrossRefzbMATHGoogle Scholar
  25. Niyogi P, Girosi F, Poggio T (1998) Incorporating prior information in machine learning by creating virtual examples. Proc IEEE 86:2196–2209CrossRefGoogle Scholar
  26. Pai P-F (2006) System reliability forecasting by support vector machines with genetic algorithms. Math Comput Model 43:262–274MathSciNetCrossRefzbMATHGoogle Scholar
  27. Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90CrossRefzbMATHGoogle Scholar
  28. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering Information. Sciences 291:184–203Google Scholar
  29. Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst (TODS) 42:19MathSciNetCrossRefGoogle Scholar
  30. Sezer EA, Nefeslioglu HA, Gokceoglu C (2014) An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Appl Soft Comput 24:126–134CrossRefGoogle Scholar
  31. Shao C, Song X, Yang X, Wu X (2016) Extended minimum-squared error algorithm for robust face recognition via auxiliary mirror samples. Soft Comput 20:3177–3187CrossRefGoogle Scholar
  32. Song X, Shao C, Yang X, Wu X (2017) Sparse representation-based classification using generalized weighted extended dictionary. Soft Comput 21:4335–4348CrossRefGoogle Scholar
  33. Student (1908) The probable error of a mean. Biometrika 6:1–25CrossRefGoogle Scholar
  34. Tang D, Zhu N, Yu F, Chen W, Tang T (2014) A novel sparse representation method based on virtual samples for face recognition. Neural Comput Appl 24:513–519CrossRefGoogle Scholar
  35. Yang J, Yu X, Xie Z-Q, Zhang J-P (2011) A novel virtual sample generation method based on Gaussian distribution. Knowl Based Syst 24:740–748CrossRefGoogle Scholar
  36. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353CrossRefzbMATHGoogle Scholar
  37. Zhou J, Duan B, Huang J, Li N (2015) Incorporating prior knowledge and multi-kernel into linear programming support vector regression. Soft Comput 19:2047–2061CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Der-Chiang Li
    • 1
  • Liang-Sian Lin
    • 2
    Email author
  • Chien-Chih Chen
    • 1
  • Wei-Hao Yu
    • 1
  1. 1.Department of Industrial and Information ManagementNational Cheng Kung UniversityTainanTaiwan, ROC
  2. 2.Information and Communications Research LaboratoriesIndustrial Technology Research InstituteChutung, HsinchuTaiwan, ROC

Personalised recommendations