Training data augmentation using generative models with statistical guarantees for materials informatics

Ohno, Hiroshi

doi:10.1007/s00500-021-06533-3

Training data augmentation using generative models with statistical guarantees for materials informatics

Data analytics and machine learning
Published: 01 December 2021

Volume 26, pages 1181–1196, (2022)
Cite this article

Soft Computing Aims and scope Submit manuscript

Hiroshi Ohno¹

372 Accesses
3 Citations
Explore all metrics

Abstract

In materials $\mathcal {z}$ science, the small data size problem occurs frequently when applying machine learning algorithms. This issue is primarily due to, for example, real experimental measurement cost and time-consuming simulations based on materials models, e.g., density functional theory. We address the small data issue by generating training data using generative models. The proposed training data augmentation method can generate data using kernel density estimation models while maintaining statistical guarantees using unbiased estimators of linear regression, which has low computational costs and easy hyper-parameter tuning compared to deep neural networks. In addition, we derive an upper bound for the $L_{1}$ loss of the proposed method relative to probability density functions. Experiments were conducted with four benchmark and three materials (i.e., binary compounds, oxide ionic conductivity, and phosphorescent materials) datasets. Regarding the generated data, we examined training and generalization performances using kernel ridge regression compared to those of generative adversarial networks and real-NVPs. For the materials datasets, we analyzed influential factors regarding material properties (conductivity and emission intensity). The experimental results demonstrate that the proposed method can generate data that can be used as new training data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An open experimental database for exploring inorganic materials

Article Open access 03 April 2018

Materials Informatics: Overview

Notes

The hyper-parameter values of model were determined by the best generalization performance using the hold-out test since small data.
To obtain ten datasets for each round, 500 distinct random seeds maximum were needed for the each dataset.
T-values were calculated by $ |x_{1} - x_{2}|/\sqrt{(\sigma _{1}^{2}+\sigma _{2}^{2})/\mathrm{n}} $, where $ x_{1,2} $ are the average values, $ \sigma _{1,2}^{2} $ are the unbiased variance of $ x_{1,2} $, and n is the number of data. In the experiments, n was ten.
For example, for GAN, https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html, and for real-NVP, https://github.com/senya-ashukha/real-nvp-pytorch.
For calculating p-values of the regression coefficients, we referred to the Web site “https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression“.

References

Arjovsky M, Bottou L (2017) Towards principled methods for training generative adversarial networks. In: Proceedings of the 5th International Conference on Learning Representations
Bishop CM (2006) Pattern recognition and machine learning. Springer, New Yrok
MATH Google Scholar
Blöchl PE (1994) Projector augmented-wave method. Phys Rev B 50:17953–17979
Article Google Scholar
Cubuk ED, Sendek AD, Reed EJ (2019) Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J Chem Phys 150(21):214701
Article Google Scholar
Cui Z, Xue F, Cai X, Cao Y, Wang Gg, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Ind Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018.2822680
Article Google Scholar
Danihelka I, Lakshminarayanan B, Uria B, Wierstra D, Dayan P (2017) Comparison of maximum likelihood and GAN-based training of real NVPs. CoRR abs/1705.05263
Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net
Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. CoRR abs/1508.06576
Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: critical role of the descriptor. Phys Rev Lett 114:105503
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates Inc., New York, pp 2672–2680
Google Scholar
Gurumurthy S, Sarvadevabhatla RK, Babu RV (2017) Deligan: Generative adversarial networks for diverse and limited data. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, pp 4941–4949
Hazama H, Sobue S, Tajima S, Asahi R (2019) Phosphorescent material search using a combination of high-throughput evaluation and machine learning. Inorg Chem 58(16):10936–10943
Article Google Scholar
He Y, Cubuk ED, Allendorf MD, Reed EJ (2018) Metallic metal-organic frameworks predicted by the combination of machine learning methods and ab initio calculations. J Phys Chem Lett 9(16):4562–4569
Article Google Scholar
Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H (2019) Harris hawks optimization: algorithm and applications. Future Gener Comput Syst 97:849–872. https://doi.org/10.1016/j.future.2019.02.028
Article Google Scholar
Kajita S, Ohba N, Suzumura A, Tajima S, Asahi R (2020) Discovery of superionic conductors by ensemble-scope descriptor. NPG Asia Mater 12(1):31
Article Google Scholar
Kresse G, Furthmüller J (1996) Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys Rev B 54:11169–11186
Article Google Scholar
Li S, Chen H, Wang M, Heidari AA, Mirjalili S (2020) Slime mould algorithm: a new method for stochastic optimization. Future Gener Comput Syst 111:300–323. https://doi.org/10.1016/j.future.2020.03.055
Article Google Scholar
Lukasik S (2007) Parallel computing of kernel density estimates with mpi. In: International Conference on Computational Science
Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi ACI (2018) BAGAN: data augmentation with balancing GAN. CoRR abs/1803.09655
Matsubara M, Suzumura A, Ohba N, Asahi R (2020) Identifying superionic conductors by materials informatics and high-throughput synthesis. Commun Mater 1(1):5
Article Google Scholar
Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR abs/1411.1784
Ohno H (2019) Training data augmentation: an empirical study using generative adversarial net-based approach with normalizing flow models for materials informatics. Appl Soft Comput 86:105932
Article Google Scholar
Ohno H (2020) Auto-encoder-based generative models for data augmentation on regression problems. Soft Comput 24(11):7999–8009
Article Google Scholar
Onat B, Cubuk ED, Malone BD, Kaxiras E (2018) Implanted neural network potentials: application to li-si alloys. Phys Rev B 97:094106
Article Google Scholar
P Kingma D, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations
Raykar V, Duraiswami R, Zhao L (2010) Fast computation of kernel estimators. J Comput Graph Stat 19:205–220
Article MathSciNet Google Scholar
Saad Y, Gao D, Ngo T, Bobbitt S, Chelikowsky JR, Andreoni W (2012) Data mining for materials: computational experiments with ${AB}$ compounds. Phys Rev B 85:104104
Article Google Scholar
Sanna S, Esposito V, Christensen M, Pryds N (2016) High ionic conductivity in confined bismuth oxide-based heterostructures. APL Mater 4(12):121101
Article Google Scholar
Sendek AD, Yang Q, Cubuk ED, Duerloo KAN, Cui Y, Reed EJ (2017) Holistic computational structure screening of more than 12000 candidates for solid lithium-ion conductor materials. Energy Environ Sci 10:306–320
Article Google Scholar
Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):60
Article Google Scholar
Vershynin R (2018) High-dimensional probability: an introduction with applications in data science. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge, pp 5–55
Google Scholar
Wang G, Guo L, Duan H (2013) Wavelet neural network using multiple wavelet functions in target threat assessment. Sci World J 2013:632437. https://doi.org/10.1155/2013/632437
Article Google Scholar
Wang G, Lu M, Dong YQ, Zhao X (2015a) Self-adaptive extreme learning machine. Neural Comput Appl 27:291–303
Article Google Scholar
Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memet Comput 10(2):151–164. https://doi.org/10.1007/s12293-016-0212-3
Article Google Scholar
Wang GG, Deb S, Coelho LdS (2015b) Elephant herding optimization. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp 1–5, 10.1109/ISCBI.2015.8
Wang GG, Deb S, Coelho LDS (2018) International Journal of Bio-Inspired Computation 12(1):1–22
Wang GG, Deb S, Cui Z (2019) Monarch butterfly optimization. Neural Comput Appl 31(7):1995–2014. https://doi.org/10.1007/s00521-015-1923-y
Article Google Scholar
Wu Y, Burda Y, Salakhutdinov R, Grosse RB (2016) On the quantitative analysis of decoder-based generative models. CoRR abs/1611.04273
Xin-she Y (2010) A new metaheuristic bat-inspired algorithm. NatureInspired cooperative strategies for optimization (NICSO 2010), SCI, New York, NY. Springer, USA, pp 65–74
Google Scholar
Yi JH, Wang J, Wang GG (2016) Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem. Adv Mech Eng 8(1):1–13. https://doi.org/10.1177/1687814015624832
Article Google Scholar
Zhang S, Zhang S, Wang B, Habetler TG (2020) Deep learning algorithms for bearing fault diagnostics-a comprehensive review. IEEE Access 8:29857–29881
Article Google Scholar
Zhang W, Li X, Jia XD, Ma H, Luo Z, Li X (2020) Machinery fault diagnosis with imbalanced data using deep generative adversarial networks. Measurement 152:107377
Article Google Scholar
Zhang Y, Ling C (2018) A strategy to apply machine learning to small datasets in materials science. npj Comput Mater 4(1):25
Article Google Scholar

Download references

Acknowledgements

The author would like to thank the anonymous reviewer for their valuable comments and suggestions on the manuscript and Dr. Nobuko Ohba for preparing the ionic conductivity dataset and discussion and Dr. Hirofumi Hazama for providing the phosphorescent materials dataset.

Author information

Authors and Affiliations

Toyota Central R&D Labs., Inc., 41-1 Yokomichi, Nagakute, Aichi, 480-1192, Japan
Hiroshi Ohno

Authors

Hiroshi Ohno
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroshi Ohno.

Ethics declarations

Conflict of interest

The author declares that there are no conflicts of interest regarding the publication of this article.

Human and animals rights

This article does not contain any studies with human participants or animals performed by the author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of unbiased estimator of ordinary least squares

Assume the true model is $ {\varvec{Y}} = {\varvec{X}}\beta + \epsilon $, where $ \epsilon \sim {\mathcal {N}}(0, \sigma ^{2}I) $, $ {\varvec{X}} \in {\mathbb {R}}^{\mathrm{n} \times \mathrm{d}} $, $ {\varvec{Y}} \in {\mathbb {R}}^{\mathrm{n}} $, $ \beta \in {\mathbb {R}}^{\mathrm{d}} $, n denotes data size, and d denotes the number of inputs. Then, we obtain the following:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_{\epsilon } [{\hat{\beta }}]&= {\mathbb {E}}_{\epsilon } [({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}}{\varvec{Y}}]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}} {\mathbb {E}}_{\epsilon } [{\varvec{Y}}]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}} {\mathbb {E}}_{\epsilon } [{\varvec{X}}\beta + \epsilon ]\\&= ({\varvec{X}}^{{\mathsf {T}}}{\varvec{X}})^{-1}{\varvec{X}}^{{\mathsf {T}}}{\varvec{X}} \beta \; \; \;(\mathrm{since \;the\;assumption\;} {\mathbb {E}}_{\epsilon } [\epsilon ] = 0)\\&= \beta , \end{aligned} \end{aligned}$$

(6)

where superscript $ {\mathsf {T}} $ denotes the transpose of the matrix.

B Proof of Equation (5)

The $ L_{1} $ loss between $ {\hat{p}}_{\mathrm{n}} (x) $ and p(x) is given as follows:

$$\begin{aligned} f(x_{1}, \ldots , x_{\mathrm{n}}) = \int _{-\infty }^{\infty } | {\hat{p}}_{\mathrm{n}}(x) - p(x) | \, dx. \end{aligned}$$

(7)

Moreover, when $ x_{i} \ne x'_{i} $, for any pair of n-tuples $ (x_{1}, \ldots , x_{\mathrm{n}}) $ and $ (x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) $, we can write

$$\begin{aligned}&| f(x_{1}, \ldots , x_{\mathrm{n}}) - f(x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) |\nonumber \\&\quad = \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - p(x) \right| \, dx \right. \nonumber \\&\qquad \left. - \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - p(x) \right| \, dx \right| \nonumber \\&\quad \le \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) - \frac{1}{\mathrm{nh}} \sum _{j=1}^{\mathrm{n}} K\left( \frac{x-x_{j}}{\mathrm{h}}\right) \right| \, dx \right| \nonumber \\&\quad = \left| \int _{-\infty }^{\infty } \left| \frac{1}{\mathrm{nh}} \left( K\left( \frac{x-x_{i}}{\mathrm{h}}\right) - K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \right) \right| \, dx \right| \nonumber \\&\quad = \frac{1}{\mathrm{nh}} \int _{-\infty }^{\infty } \left| K\left( \frac{x-x_{i}}{\mathrm{h}}\right) - K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \right| \, dx\nonumber \\&\quad \le \frac{1}{\mathrm{nh}} \left( \int _{-\infty }^{\infty } K\left( \frac{x-x_{i}}{\mathrm{h}}\right) \,dx\right. \nonumber \\&\qquad \left. + \int _{-\infty }^{\infty } K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \, dx \right) . \end{aligned}$$

(8)

Since $ \int _{-\infty }^{\infty } K(x) \, dx = 1 $, we obtain

$$\begin{aligned} \begin{aligned}&\left| f(x_{1}, \ldots , x_{\mathrm{n}}) - f(x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) \right| \\&\le \frac{1}{\mathrm{nh}} \left( \int _{-\infty }^{\infty } K\left( \frac{x-x_{i}}{\mathrm{h}}\right) \,dx + \int _{-\infty }^{\infty } K\left( \frac{x-x'_{i}}{\mathrm{h}}\right) \, dx \right) \\&= \frac{1}{\mathrm{nh}}( 2\mathrm{h} ) = \frac{2}{\mathrm{n}}. \end{aligned} \end{aligned}$$

(9)

Using McDiarmid’s inequality, for all $ \delta > 0 $,

$$\begin{aligned} \begin{aligned} P\left[ f - {\mathbb {E}}[f] \ge \delta \right]&\le \exp {\left( -\frac{2 \delta ^{2}}{\sum _{i=1}^{\mathrm{n}}(\frac{2}{\mathrm{n}})^{2}} \right) }\\&= \exp {\left( -\frac{\mathrm{n} \delta ^{2}}{2} \right) }. \end{aligned} \end{aligned}$$

(10)

Here, since with probability $ 1 - \eta $, it holds $ || {\hat{p}} - p ||_{1} < {\mathbb {E}}_{x \sim p} [ || {\hat{p}} - p ||_{1} ] + \delta $, for all n,

$$\begin{aligned} \begin{aligned} || {\hat{p}}_{\mathrm{n}} - p ||_{1}&< {\mathbb {E}}_{x \sim p} [ || {\hat{p}}^{1}_{\mathrm{n}} - p ||_{1} ] + \delta \\&= {\mathbb {E}}_{x \sim p} [ || {\hat{p}}^{1}_{\mathrm{n}} - p ||_{1} ] + \frac{1}{\sqrt{\mathrm{n}}}\sqrt{ 2 \ln \frac{1}{\eta } }. \end{aligned} \end{aligned}$$

(11)

Thus, Eq. 5 follows from Eq. 11.

C Attributes of input variables for datasets

Table 12 Attributes of input variables for benchmark datasets

Full size table

Table 13 Attributes of input variables for materials datasets (for LUMI, e.g., MgO indicates film thickness of MgO in nm)

Full size table

D Implementation note

The software used in this study was implemented in Python3, PyTorch, and Scikit-learn on Linux machine. The KDE in Scikit-learn was used. The GAN and real-NVP were constructed by referring to the Web sites.^{Footnote 4} In addition, the main part of the TDA algorithm (lines 8–28 in Algorithm 1) is described as follows:^{Footnote 5}

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ohno, H. Training data augmentation using generative models with statistical guarantees for materials informatics. Soft Comput 26, 1181–1196 (2022). https://doi.org/10.1007/s00500-021-06533-3

Download citation

Accepted: 01 November 2021
Published: 01 December 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00500-021-06533-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training data augmentation using generative models with statistical guarantees for materials informatics

Abstract

Access this article

Similar content being viewed by others

An open experimental database for exploring inorganic materials

Materials Informatics: Overview

Materials Informatics: Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animals rights

Additional information

Publisher's Note

Appendices

A Proof of unbiased estimator of ordinary least squares

B Proof of Equation (5)

C Attributes of input variables for datasets

D Implementation note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training data augmentation using generative models with statistical guarantees for materials informatics

Abstract

Access this article

Similar content being viewed by others

An open experimental database for exploring inorganic materials

Materials Informatics: Overview

Materials Informatics: Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animals rights

Additional information

Publisher's Note

Appendices

A Proof of unbiased estimator of ordinary least squares

B Proof of Equation (5)

C Attributes of input variables for datasets

D Implementation note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation