Abstract
In materials \(\mathcal {z}\) science, the small data size problem occurs frequently when applying machine learning algorithms. This issue is primarily due to, for example, real experimental measurement cost and time-consuming simulations based on materials models, e.g., density functional theory. We address the small data issue by generating training data using generative models. The proposed training data augmentation method can generate data using kernel density estimation models while maintaining statistical guarantees using unbiased estimators of linear regression, which has low computational costs and easy hyper-parameter tuning compared to deep neural networks. In addition, we derive an upper bound for the \(L_{1}\) loss of the proposed method relative to probability density functions. Experiments were conducted with four benchmark and three materials (i.e., binary compounds, oxide ionic conductivity, and phosphorescent materials) datasets. Regarding the generated data, we examined training and generalization performances using kernel ridge regression compared to those of generative adversarial networks and real-NVPs. For the materials datasets, we analyzed influential factors regarding material properties (conductivity and emission intensity). The experimental results demonstrate that the proposed method can generate data that can be used as new training data.
Similar content being viewed by others
Notes
The hyper-parameter values of model were determined by the best generalization performance using the hold-out test since small data.
To obtain ten datasets for each round, 500 distinct random seeds maximum were needed for the each dataset.
T-values were calculated by \( |x_{1} - x_{2}|/\sqrt{(\sigma _{1}^{2}+\sigma _{2}^{2})/\mathrm{n}} \), where \( x_{1,2} \) are the average values, \( \sigma _{1,2}^{2} \) are the unbiased variance of \( x_{1,2} \), and n is the number of data. In the experiments, n was ten.
For example, for GAN, https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html, and for real-NVP, https://github.com/senya-ashukha/real-nvp-pytorch.
For calculating p-values of the regression coefficients, we referred to the Web site “https://stackoverflow.com/questions/27928275/find-p-value-significance-in-scikit-learn-linearregression“.
References
Arjovsky M, Bottou L (2017) Towards principled methods for training generative adversarial networks. In: Proceedings of the 5th International Conference on Learning Representations
Bishop CM (2006) Pattern recognition and machine learning. Springer, New Yrok
Blöchl PE (1994) Projector augmented-wave method. Phys Rev B 50:17953–17979
Cubuk ED, Sendek AD, Reed EJ (2019) Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J Chem Phys 150(21):214701
Cui Z, Xue F, Cai X, Cao Y, Wang Gg, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Ind Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018.2822680
Danihelka I, Lakshminarayanan B, Uria B, Wierstra D, Dayan P (2017) Comparison of maximum likelihood and GAN-based training of real NVPs. CoRR abs/1705.05263
Dinh L, Sohl-Dickstein J, Bengio S (2017) Density estimation using real NVP. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net
Gatys LA, Ecker AS, Bethge M (2015) A neural algorithm of artistic style. CoRR abs/1508.06576
Ghiringhelli LM, Vybiral J, Levchenko SV, Draxl C, Scheffler M (2015) Big data of materials science: critical role of the descriptor. Phys Rev Lett 114:105503
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates Inc., New York, pp 2672–2680
Gurumurthy S, Sarvadevabhatla RK, Babu RV (2017) Deligan: Generative adversarial networks for diverse and limited data. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, pp 4941–4949
Hazama H, Sobue S, Tajima S, Asahi R (2019) Phosphorescent material search using a combination of high-throughput evaluation and machine learning. Inorg Chem 58(16):10936–10943
He Y, Cubuk ED, Allendorf MD, Reed EJ (2018) Metallic metal-organic frameworks predicted by the combination of machine learning methods and ab initio calculations. J Phys Chem Lett 9(16):4562–4569
Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H (2019) Harris hawks optimization: algorithm and applications. Future Gener Comput Syst 97:849–872. https://doi.org/10.1016/j.future.2019.02.028
Kajita S, Ohba N, Suzumura A, Tajima S, Asahi R (2020) Discovery of superionic conductors by ensemble-scope descriptor. NPG Asia Mater 12(1):31
Kresse G, Furthmüller J (1996) Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys Rev B 54:11169–11186
Li S, Chen H, Wang M, Heidari AA, Mirjalili S (2020) Slime mould algorithm: a new method for stochastic optimization. Future Gener Comput Syst 111:300–323. https://doi.org/10.1016/j.future.2020.03.055
Lukasik S (2007) Parallel computing of kernel density estimates with mpi. In: International Conference on Computational Science
Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi ACI (2018) BAGAN: data augmentation with balancing GAN. CoRR abs/1803.09655
Matsubara M, Suzumura A, Ohba N, Asahi R (2020) Identifying superionic conductors by materials informatics and high-throughput synthesis. Commun Mater 1(1):5
Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR abs/1411.1784
Ohno H (2019) Training data augmentation: an empirical study using generative adversarial net-based approach with normalizing flow models for materials informatics. Appl Soft Comput 86:105932
Ohno H (2020) Auto-encoder-based generative models for data augmentation on regression problems. Soft Comput 24(11):7999–8009
Onat B, Cubuk ED, Malone BD, Kaxiras E (2018) Implanted neural network potentials: application to li-si alloys. Phys Rev B 97:094106
P Kingma D, Welling M (2014) Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations
Raykar V, Duraiswami R, Zhao L (2010) Fast computation of kernel estimators. J Comput Graph Stat 19:205–220
Saad Y, Gao D, Ngo T, Bobbitt S, Chelikowsky JR, Andreoni W (2012) Data mining for materials: computational experiments with \({AB}\) compounds. Phys Rev B 85:104104
Sanna S, Esposito V, Christensen M, Pryds N (2016) High ionic conductivity in confined bismuth oxide-based heterostructures. APL Mater 4(12):121101
Sendek AD, Yang Q, Cubuk ED, Duerloo KAN, Cui Y, Reed EJ (2017) Holistic computational structure screening of more than 12000 candidates for solid lithium-ion conductor materials. Energy Environ Sci 10:306–320
Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):60
Vershynin R (2018) High-dimensional probability: an introduction with applications in data science. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge, pp 5–55
Wang G, Guo L, Duan H (2013) Wavelet neural network using multiple wavelet functions in target threat assessment. Sci World J 2013:632437. https://doi.org/10.1155/2013/632437
Wang G, Lu M, Dong YQ, Zhao X (2015a) Self-adaptive extreme learning machine. Neural Comput Appl 27:291–303
Wang GG (2018) Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memet Comput 10(2):151–164. https://doi.org/10.1007/s12293-016-0212-3
Wang GG, Deb S, Coelho LdS (2015b) Elephant herding optimization. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp 1–5, 10.1109/ISCBI.2015.8
Wang GG, Deb S, Coelho LDS (2018) International Journal of Bio-Inspired Computation 12(1):1–22
Wang GG, Deb S, Cui Z (2019) Monarch butterfly optimization. Neural Comput Appl 31(7):1995–2014. https://doi.org/10.1007/s00521-015-1923-y
Wu Y, Burda Y, Salakhutdinov R, Grosse RB (2016) On the quantitative analysis of decoder-based generative models. CoRR abs/1611.04273
Xin-she Y (2010) A new metaheuristic bat-inspired algorithm. NatureInspired cooperative strategies for optimization (NICSO 2010), SCI, New York, NY. Springer, USA, pp 65–74
Yi JH, Wang J, Wang GG (2016) Improved probabilistic neural networks with self-adaptive strategies for transformer fault diagnosis problem. Adv Mech Eng 8(1):1–13. https://doi.org/10.1177/1687814015624832
Zhang S, Zhang S, Wang B, Habetler TG (2020) Deep learning algorithms for bearing fault diagnostics-a comprehensive review. IEEE Access 8:29857–29881
Zhang W, Li X, Jia XD, Ma H, Luo Z, Li X (2020) Machinery fault diagnosis with imbalanced data using deep generative adversarial networks. Measurement 152:107377
Zhang Y, Ling C (2018) A strategy to apply machine learning to small datasets in materials science. npj Comput Mater 4(1):25
Acknowledgements
The author would like to thank the anonymous reviewer for their valuable comments and suggestions on the manuscript and Dr. Nobuko Ohba for preparing the ionic conductivity dataset and discussion and Dr. Hirofumi Hazama for providing the phosphorescent materials dataset.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that there are no conflicts of interest regarding the publication of this article.
Human and animals rights
This article does not contain any studies with human participants or animals performed by the author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of unbiased estimator of ordinary least squares
Assume the true model is \( {\varvec{Y}} = {\varvec{X}}\beta + \epsilon \), where \( \epsilon \sim {\mathcal {N}}(0, \sigma ^{2}I) \), \( {\varvec{X}} \in {\mathbb {R}}^{\mathrm{n} \times \mathrm{d}} \), \( {\varvec{Y}} \in {\mathbb {R}}^{\mathrm{n}} \), \( \beta \in {\mathbb {R}}^{\mathrm{d}} \), n denotes data size, and d denotes the number of inputs. Then, we obtain the following:
where superscript \( {\mathsf {T}} \) denotes the transpose of the matrix.
B Proof of Equation (5)
The \( L_{1} \) loss between \( {\hat{p}}_{\mathrm{n}} (x) \) and p(x) is given as follows:
Moreover, when \( x_{i} \ne x'_{i} \), for any pair of n-tuples \( (x_{1}, \ldots , x_{\mathrm{n}}) \) and \( (x_{1}, \ldots , x'_{i}, \ldots , x_{\mathrm{n}}) \), we can write
Since \( \int _{-\infty }^{\infty } K(x) \, dx = 1 \), we obtain
Using McDiarmid’s inequality, for all \( \delta > 0 \),
Here, since with probability \( 1 - \eta \), it holds \( || {\hat{p}} - p ||_{1} < {\mathbb {E}}_{x \sim p} [ || {\hat{p}} - p ||_{1} ] + \delta \), for all n,
Thus, Eq. 5 follows from Eq. 11.
C Attributes of input variables for datasets
D Implementation note
The software used in this study was implemented in Python3, PyTorch, and Scikit-learn on Linux machine. The KDE in Scikit-learn was used. The GAN and real-NVP were constructed by referring to the Web sites.Footnote 4 In addition, the main part of the TDA algorithm (lines 8–28 in Algorithm 1) is described as follows:Footnote 5
Rights and permissions
About this article
Cite this article
Ohno, H. Training data augmentation using generative models with statistical guarantees for materials informatics. Soft Comput 26, 1181–1196 (2022). https://doi.org/10.1007/s00500-021-06533-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06533-3