Abstract
Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.
This is a preview of subscription content, access via your institution.






References
A. White: The materials genome initiative: one year on. MRS Bull. 37, 71–716 (2012).
B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster: The materials data facility: data services to advance materials science research. JOM 68, 204–2052 (2016).
R. Chard, Z. Li, K. Chard, L. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M.J. Franklin, and I. Foster: DLHub: Model and Data Serving for Science, 2018. http://arxiv.org/abs/1811.11213 (accessed March 8, 2019).
P. Nguyen, S. Konstanty, T. Nicholson, T. OBrien, A. Schwartz-Duval, T. Spila, K. Nahrstedt, R.H. Campbell, I. Gupta, M. Chan, K. Mchenry, and N. Paquin: 4CeeD: real-time data acquisition and analysis framework for material-related cyber-physical environments. In 2017 17th IEEE/ ACM Int. Symp. Clust. Cloud Grid Comput., IEEE, 2017; pp. 11–20. doi:10.1109/CCGRID.2017.51.
J. O’Mara, B. Meredig, and K. Michel: Materials data infrastructure: a case study of the citrination platform to examine data import, storage, and access. JOM 68, 2031–2034 (2016).
A. Dima, S. Bhaskarla, C. Becker, M. Brady, C. Campbell, P. Dessauw, R. Hanisch, U. Kattner, K. Kroenlein, M. Newrock, A. Peskin, R. Plante, S.-Y. Li, P.-F. Rigodiat, G.S. Amaral, Z. Trautt, X. Schmitt, J. Warren, and S. Youssef: Informatics infrastructure for the materials genome initiative. JOM 68, 2053–2064 (2016).
S. Kirklin, J.E. Saal, B. Meredig, A. Thompson, J.W. Doak, M. Aykol, S. Rühl, and C. Wolverton: The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater 1, 15010 (2015).
A. Jain, S.P. Ong, G. Hautier, W. Chen, W.D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K.A. Persson: Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
C. Draxl and M. Scheffler: NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43, 676–682 (2018).
J. Carrete, W. Li, N. Mingo, S. Wang, and S. Curtarolo: Finding unprece-dentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4, 011019 (2014).
S. Curtarolo, W. Setyawan, S. Wang, J. Xue, K. Yang, R.H. Taylor, L.J. Nelson, G.L.W. Hart, S. Sanvito, M. Buongiorno-Nardelli, N. Mingo, and O. Levy: AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 (2012).
A. Mannodi-Kanakkithodi, A. Chandrasekaran, C. Kim, T.D. Huan, G. Pilania, V. Botu, and R. Ramprasad: Scoping the polymer genome: a roadmap for rational polymer dielectrics design and beyond. Mater. Today (2017). doi:10.1016/j.mattod.2017.11.021.
R.B. Tchoua, K. Chard, D.J. Audus, L.T. Ward, J. Lequieu, J.J. De Pablo, and I.T. Foster: Towards a hybrid human-computer scientific information extraction pipeline. In 2017 IEEE 13th Int. Conf. e-Science, IEEE, 2017; pp. 109–118. doi:10.1109/eScience.2017.23.
B. Puchala, G. Tarcea, E.A. Marquis, M. Hedstrom, H.V. Jagadish, and J.E. Allison: The materials commons: a collaboration platform and information repository for the global materials community. JOM 68, 203–2044 (2016).
Materials Simulation Toolkit for Machine Learning (MAST-ML), (n.d.): https://github.com/uw-cmg/MAST-ML (accessed June 27, 2019).
D. Wheeler, D. Brough, T. Fast, S. Kalidindi, and A. Reid: PyMKS: materials knowledge system in python (2014).
L. Ward, A. Dunn, A. Faghaninia, N.E.R. Zimmermann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K.A. Persson, G.J. Snyder, I. Foster, and A. Jain: Matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
S.P. Ong, W.D. Richards, A. Jain, G. Hautier, M. Kocher, S. Cholia, D. Gunter, V.L. Chevrier, K.A. Persson, and G. Ceder: Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
J. Schneider and J. Hamaekers: The atomic simulation environment - a Python library for working with atoms: related content ATK-forceField: a new generation molecular dynamics software package. J. Phys. Condens. Matter Top. Rev (2017). doi:10.1088/1361-648X/aa680e.
Materials Data Facility Schema Repository, (n.d.): https://github.com/materials-data-facility/data-schemas (accessed June 27, 2019).
I. Foster, K. Chard, and S. Tuecke: The discovery cloud: accelerating and democratizing research on a global scale. In 2016 IEEE Int. Conf. Cloud Eng., IEEE, 2016; pp. 68–77. doi:10.1109/IC2E.2016.46.
R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Chard, B. McCollam, J. Pruyne, S. Rosen, S. Tuecke, and I. Foster: Globus platform services for data publication. In Proc. Pract. Exp. Adv. Res. Comput. - PEARC’ 18; ACM Press, New York, NY, USA, 2018; pp. 1–7. doi:10.1145/ 3219104.3219127.
Z. Avsec, R. Kreuzhuber, J. Israeli, N. Xu, J. Cheng, A. Shrikumar, A. Banerjee, D.S. Kim, L. Urban, A. Kundaje, O. Stegle, and J. Gagneur: Kipoi: accelerating the community exchange and reuse of predictive models for genomics. BioRxiv, 375345 (2018). doi:10.1101/375345.
DataCite Schema, (n.d.): https://schema.datacite.org/ (accessed March 8, 2019).
Y. Babuji, A. Brizius, K. Chard, I. Foster, D.S. Katz, M. Wilde, and J. Wozniak: Introducing parsl: a python parallel scripting library (2017). doi:10.5281/ZENODO.891533.
H.S. Stein, D. Guevarra, P.F. Newhouse, E. Soedarmadji, and J.M. Gregoire: Machine learning of optical properties of materials–predicting spectra from images and images from spectra. Chem. Sci. 10, 47–55 (2019).
S. Mitrovic, E. Soedarmadji, P.F. Newhouse, S.K. Suram, J.A. Haber, J. Jin, and J.M. Gregoire: Colorimetric screening for high-throughput discovery of light absorbers. ACS Comb. Sci. 17, 176–181 (2015).
M. Schwarting, S. Siol, K. Talley, A. Zakutayev, and C. Phillips: Automated algorithms for band gap analysis from optical absorption spectra. Mater. Discov. 10, 43–52 (2017).
L. van der Maaten and G. Hinton: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
M.J. Cherukara, Y.S.G. Nashed, and R.J. Harder: Real-time coherent diffraction inversion using deep generative networks. Sci. Rep. 8, 16520 (2018).
L.A. Curtiss, P.C. Redfern, and K. Raghavachari: Gaussian-4 theory using reduced order perturbation theory. J. Chem. Phys. 127, 124105 (2007).
L. Ward, B. Blaiszik, I. Foster, R.S. Assary, B. Narayanan, and L. Curtiss: Machine learning prediction of accurate atomization energies of organic molecules from low-fidelity quantum chemical calculations. MRS Commun 9(3), 891–899 (2019). doi:10.1557/mrc.2019.107.
K.T. Schütt, H.E. Sauceda, P.-J. Kindermans, A. Tkatchenko, and K.-R. Müller: SchNet–a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld: Big data meets quantum chemistry approximations: the Δ-machine learning approach. J. Chem. Theory Comput. 11, 2087–2096 (2015).
Acknowledgements
MDF: This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the Midwest Big Data Hub under NSF Award Number: 1636950 “BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate.” DLHub: This work was supported in part by Laboratory Directed Research and Development funding from Argonne National Laboratory under U.S. Department of Energy under Contract DE-AC02-06CH11357. We also thank the Argonne Leadership Computing Facility for access to the PetrelKube Kubernetes cluster and Amazon Web Services for providing research credits to enable rapid service prototyping. This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. The authors would also like to acknowledge and thank the researchers who made their datasets and/or models and codes openly available.[26,30,32]
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Blaiszik, B., Ward, L., Schwarting, M. et al. A data ecosystem to support machine learning in materials science. MRS Communications 9, 1125–1133 (2019). https://doi.org/10.1557/mrc.2019.118
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1557/mrc.2019.118