Skip to main content
Log in

Training data augmentation using generative models with statistical guarantees for materials informatics

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In materials \(\mathcal {z}\) science, the small data size problem occurs frequently when applying machine learning algorithms. This issue is primarily due to, for example, real experimental measurement cost and time-consuming simulations based on materials models, e.g., density functional theory. We address the small data issue by generating training data using generative models. The proposed training data augmentation method can generate data using kernel density estimation models while maintaining statistical guarantees using unbiased estimators of linear regression, which has low computational costs and easy hyper-parameter tuning compared to deep neural networks. In addition, we derive an upper bound for the \(L_{1}\) loss of the proposed method relative to probability density functions. Experiments were conducted with four benchmark and three materials (i.e., binary compounds, oxide ionic conductivity, and phosphorescent materials) datasets. Regarding the generated data, we examined training and generalization performances using kernel ridge regression compared to those of generative adversarial networks and real-NVPs. For the materials datasets, we analyzed influential factors regarding material properties (conductivity and emission intensity). The experimental results demonstrate that the proposed method can generate data that can be used as new training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The hyper-parameter values of model were determined by the best generalization performance using the hold-out test since small data.

  2. To obtain ten datasets for each round, 500 distinct random seeds maximum were needed for the each dataset.

  3. T-values were calculated by \( |x_{1} - x_{2}|/\sqrt{(\sigma _{1}^{2}+\sigma _{2}^{2})/\mathrm{n}} \), where \( x_{1,2} \) are the average values, \( \sigma _{1,2}^{2} \) are the unbiased variance of \( x_{1,2} \), and n is the number of data. In the experiments, n was ten.

  4. For example, for GAN, https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html, and for real-NVP, https://github.com/senya-ashukha/real-nvp-pytorch.

  5. For calculating p-values of the regression coefficients, we referred to the Web site “https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression“.

References

  • Arjovsky M, Bottou L (2017) Towards principled methods for training generative adversarial networks. In: Proceedings of the 5th International Conference on Learning Representations

  • Bishop CM (2006) Pattern recognition and machine learning. Springer, New Yrok

    MATH  Google Scholar 

  • Blöchl PE (1994) Projector augmented-wave method. Phys Rev B 50:17953–17979

    Article  Google Scholar 

  • Cubuk ED, Sendek AD, Reed EJ (2019) Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J Chem Phys 150(21):214701

    Article  Google Scholar 

  • Cui Z, Xue F, Cai X, Cao Y, Wang Gg, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Ind Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018.2822680

    Article  Google Scholar 

  • Danihelka I, Lakshminarayanan B, Uria B, Wierstra D, Dayan P (2017) Comparison of maximum likelihood and GAN-based training of real NVPs. CoRR abs/1705.05263

  • Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net

  • Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. CoRR abs/1508.06576

  • Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: critical role of the descriptor. Phys Rev Lett 114:105503

    Article  Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates Inc., New York, pp 2672–2680

    Google Scholar 

  • Gurumurthy S, Sarvadevabhatla RK, Babu RV (2017) Deligan: Generative adversarial networks for diverse and limited data. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, pp 4941–4949

  • Hazama H, Sobue S, Tajima S, Asahi R (2019) Phosphorescent material search using a combination of high-throughput evaluation and machine learning. Inorg Chem 58(16):10936–10943

    Article  Google Scholar 

  • He Y, Cubuk ED, Allendorf MD, Reed EJ (2018) Metallic metal-organic frameworks predicted by the combination of machine learning methods and ab initio calculations. J Phys Chem Lett 9(16):4562–4569

    Article  Google Scholar 

  • Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H (2019) Harris hawks optimization: algorithm and applications. Future Gener Comput Syst 97:849–872. https://doi.org/10.1016/j.future.2019.02.028

    Article  Google Scholar 

  • Kajita S, Ohba N, Suzumura A, Tajima S, Asahi R (2020) Discovery of superionic conductors by ensemble-scope descriptor. NPG Asia Mater 12(1):31

    Article  Google Scholar 

  • Kresse G, Furthmüller J (1996) Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys Rev B 54:11169–11186

    Article  Google Scholar 

  • Li S, Chen H, Wang M, Heidari AA, Mirjalili S (2020) Slime mould algorithm: a new method for stochastic optimization. Future Gener Comput Syst 111:300–323. https://doi.org/10.1016/j.future.2020.03.055

    Article  Google Scholar 

  • Lukasik S (2007) Parallel computing of kernel density estimates with mpi. In: International Conference on Computational Science

  • Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi ACI (2018) BAGAN: data augmentation with balancing GAN. CoRR abs/1803.09655

  • Matsubara M, Suzumura A, Ohba N, Asahi R (2020) Identifying superionic conductors by materials informatics and high-throughput synthesis. Commun Mater 1(1):5

    Article  Google Scholar 

  • Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR abs/1411.1784

  • Ohno H (2019) Training data augmentation: an empirical study using generative adversarial net-based approach with normalizing flow models for materials informatics. Appl Soft Comput 86:105932

    Article  Google Scholar 

  • Ohno H (2020) Auto-encoder-based generative models for data augmentation on regression problems. Soft Comput 24(11):7999–8009

    Article  Google Scholar 

  • Onat B, Cubuk ED, Malone BD, Kaxiras E (2018) Implanted neural network potentials: application to li-si alloys. Phys Rev B 97:094106

    Article  Google Scholar 

  • P Kingma D, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations

  • Raykar V, Duraiswami R, Zhao L (2010) Fast computation of kernel estimators. J Comput Graph Stat 19:205–220

    Article  MathSciNet  Google Scholar 

  • Saad Y, Gao D, Ngo T, Bobbitt S, Chelikowsky JR, Andreoni W (2012) Data mining for materials: computational experiments with \({AB}\) compounds. Phys Rev B 85:104104

    Article  Google Scholar 

  • Sanna S, Esposito V, Christensen M, Pryds N (2016) High ionic conductivity in confined bismuth oxide-based heterostructures. APL Mater 4(12):121101

    Article  Google Scholar 

  • Sendek AD, Yang Q, Cubuk ED, Duerloo KAN, Cui Y, Reed EJ (2017) Holistic computational structure screening of more than 12000 candidates for solid lithium-ion conductor materials. Energy Environ Sci 10:306–320

    Article  Google Scholar 

  • Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):60

    Article  Google Scholar 

  • Vershynin R (2018) High-dimensional probability: an introduction with applications in data science. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge, pp 5–55

    Google Scholar 

  • Wang G, Guo L, Duan H (2013) Wavelet neural network using multiple wavelet functions in target threat assessment. Sci World J 2013:632437. https://doi.org/10.1155/2013/632437

    Article  Google Scholar 

  • Wang G, Lu M, Dong YQ, Zhao X (2015a) Self-adaptive extreme learning machine. Neural Comput Appl 27:291–303

    Article  Google Scholar 

  • Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memet Comput 10(2):151–164. https://doi.org/10.1007/s12293-016-0212-3

    Article  Google Scholar 

  • Wang GG, Deb S, Coelho LdS (2015b) Elephant herding optimization. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp 1–5, 10.1109/ISCBI.2015.8

  • Wang GG, Deb S, Coelho LDS (2018) International Journal of Bio-Inspired Computation 12(1):1–22

  • Wang GG, Deb S, Cui Z (2019) Monarch butterfly optimization. Neural Comput Appl 31(7):1995–2014. https://doi.org/10.1007/s00521-015-1923-y

    Article  Google Scholar 

  • Wu Y, Burda Y, Salakhutdinov R, Grosse RB (2016) On the quantitative analysis of decoder-based generative models. CoRR abs/1611.04273

  • Xin-she Y (2010) A new metaheuristic bat-inspired algorithm. NatureInspired cooperative strategies for optimization (NICSO 2010), SCI, New York, NY. Springer, USA, pp 65–74

    Google Scholar 

  • Yi JH, Wang J, Wang GG (2016) Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem. Adv Mech Eng 8(1):1–13. https://doi.org/10.1177/1687814015624832

    Article  Google Scholar 

  • Zhang S, Zhang S, Wang B, Habetler TG (2020) Deep learning algorithms for bearing fault diagnostics-a comprehensive review. IEEE Access 8:29857–29881

    Article  Google Scholar 

  • Zhang W, Li X, Jia XD, Ma H, Luo Z, Li X (2020) Machinery fault diagnosis with imbalanced data using deep generative adversarial networks. Measurement 152:107377

    Article  Google Scholar 

  • Zhang Y, Ling C (2018) A strategy to apply machine learning to small datasets in materials science. npj Comput Mater 4(1):25

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to thank the anonymous reviewer for their valuable comments and suggestions on the manuscript and Dr. Nobuko Ohba for preparing the ionic conductivity dataset and discussion and Dr. Hirofumi Hazama for providing the phosphorescent materials dataset.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiroshi Ohno.

Ethics declarations

Conflict of interest

The author declares that there are no conflicts of interest regarding the publication of this article.

Human and animals rights

This article does not contain any studies with human participants or animals performed by the author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of unbiased estimator of ordinary least squares

Assume the true model is \( {\varvec{Y}} = {\varvec{X}}\beta + \epsilon \), where \( \epsilon \sim {\mathcal {N}}(0, \sigma ^{2}I) \), \( {\varvec{X}} \in {\mathbb {R}}^{\mathrm{n} \times \mathrm{d}} \), \( {\varvec{Y}} \in {\mathbb {R}}^{\mathrm{n}} \), \( \beta \in {\mathbb {R}}^{\mathrm{d}} \), n denotes data size, and d denotes the number of inputs. Then, we obtain the following:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_{\epsilon } [{\hat{\beta }}]&= {\mathbb {E}}_{\epsilon } [({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}}{\varvec{Y}}]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}} {\mathbb {E}}_{\epsilon } [{\varvec{Y}}]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}} {\mathbb {E}}_{\epsilon } [{\varvec{X}}\beta + \epsilon ]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}}{\varvec{X}} \beta \; \; \;(\mathrm{since \;the\;assumption\;} {\mathbb {E}}_{\epsilon } [\epsilon ] = 0)\\&= \beta , \end{aligned} \end{aligned}$$
(6)

where superscript \( {\mathsf {T}} \) denotes the transpose of the matrix.

B Proof of Equation (5)

The \( L_{1} \) loss between \( {\hat{p}}_{\mathrm{n}} (x) \) and p(x) is given as follows:

$$\begin{aligned} f(x_{1}, \ldots , x_{\mathrm{n}}) = \int _{-\infty }^{\infty } | {\hat{p}}_{\mathrm{n}}(x) - p(x) | \, dx. \end{aligned}$$
(7)

Moreover, when \( x_{i} \ne x'_{i} \), for any pair of n-tuples \( (x_{1}, \ldots , x_{\mathrm{n}}) \) and \( (x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) \), we can write

$$\begin{aligned}&| f(x_{1}, \ldots , x_{\mathrm{n}}) - f(x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) |\nonumber \\&\quad = \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - p(x) \right| \, dx \right. \nonumber \\&\qquad \left. - \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - p(x) \right| \, dx \right| \nonumber \\&\quad \le \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) \right| \, dx \right| \nonumber \\&\quad = \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \left( K\left( \frac{x-x_{i}}{\mathrm{h}}\right) - K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \right) \right| \, dx \right| \nonumber \\&\quad = \frac{1}{\mathrm{nh}} \int _{-\infty }^{\infty } \left| K\left( \frac{x-x_{i}}{\mathrm{h}}\right) - K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \right| \, dx\nonumber \\&\quad \le \frac{1}{\mathrm{nh}} \left( \int _{-\infty }^{\infty } K\left( \frac{x-x_{i}}{\mathrm{h}}\right) \,dx\right. \nonumber \\&\qquad \left. + \int _{-\infty }^{\infty } K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \, dx \right) . \end{aligned}$$
(8)

Since \( \int _{-\infty }^{\infty } K(x) \, dx = 1 \), we obtain

$$\begin{aligned} \begin{aligned}&\left| f(x_{1}, \ldots , x_{\mathrm{n}}) - f(x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) \right| \\&\le \frac{1}{\mathrm{nh}} \left( \int _{-\infty }^{\infty } K\left( \frac{x-x_{i}}{\mathrm{h}}\right) \,dx + \int _{-\infty }^{\infty } K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \, dx \right) \\&= \frac{1}{\mathrm{nh}}( 2\mathrm{h} ) = \frac{2}{\mathrm{n}}. \end{aligned} \end{aligned}$$
(9)

Using McDiarmid’s inequality, for all \( \delta > 0 \),

$$\begin{aligned} \begin{aligned} P\left[ f - {\mathbb {E}}[f] \ge \delta \right]&\le \exp {\left( -\frac{2 \delta ^{2}}{\sum _{i=1}^{\mathrm{n}}(\frac{2}{\mathrm{n}})^{2}} \right) }\\&= \exp {\left( -\frac{\mathrm{n} \delta ^{2}}{2} \right) }. \end{aligned} \end{aligned}$$
(10)

Here, since with probability \( 1 - \eta \), it holds \( || {\hat{p}} - p ||_{1} < {\mathbb {E}}_{x \sim p} [ || {\hat{p}} - p ||_{1} ] + \delta \), for all n,

$$\begin{aligned} \begin{aligned} || {\hat{p}}_{\mathrm{n}} - p ||_{1}&< {\mathbb {E}}_{x \sim p} [ || {\hat{p}}^{1}_{\mathrm{n}} - p ||_{1} ] + \delta \\&= {\mathbb {E}}_{x \sim p} [ || {\hat{p}}^{1}_{\mathrm{n}} - p ||_{1} ] + \frac{1}{\sqrt{\mathrm{n}}}\sqrt{ 2 \ln \frac{1}{\eta } }. \end{aligned} \end{aligned}$$
(11)

Thus, Eq. 5 follows from Eq. 11.

C Attributes of input variables for datasets

Table 12 Attributes of input variables for benchmark datasets
Table 13 Attributes of input variables for materials datasets (for LUMI, e.g., MgO indicates film thickness of MgO in nm)

D Implementation note

The software used in this study was implemented in Python3, PyTorch, and Scikit-learn on Linux machine. The KDE in Scikit-learn was used. The GAN and real-NVP were constructed by referring to the Web sites.Footnote 4 In addition, the main part of the TDA algorithm (lines 828 in Algorithm 1) is described as follows:Footnote 5

figure b

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ohno, H. Training data augmentation using generative models with statistical guarantees for materials informatics. Soft Comput 26, 1181–1196 (2022). https://doi.org/10.1007/s00500-021-06533-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06533-3

Keywords

Navigation