Data-centric science for materials innovation

Tanaka, Isao; Rajan, Krishna; Wolverton, Christopher

doi:10.1557/mrs.2018.205

Data-centric science for materials innovation

Published: 10 September 2018

Volume 43, pages 659–663, (2018)
Cite this article

Download PDF

MRS Bulletin Aims and scope Submit manuscript

Data-centric science for materials innovation

Download PDF

Isao Tanaka¹,
Krishna Rajan² &
Christopher Wolverton³

975 Accesses
27 Citations
6 Altmetric
Explore all metrics

Abstract

With the development of high-speed computers, networks, and huge storage, researchers can utilize a large volume and wide variety of materials data generated by experimental facilities and computations. The emergence of these big data and advanced analytical techniques has opened unprecedented opportunities for materials research. The discovery of many kinds of materials, such as energy-harvesting materials, structural materials, catalysts, optoelectronic materials, and magnetic materials, have been greatly accelerated through high-throughput screening. The utility of data-centric science for materials research is likely to grow significantly in the future. Unraveling the complexities inherent in big data could lead to novel design rules as well as new materials and functionalities.

A bridge for accelerating materials by design

Article Open access 25 November 2015

The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data-intensive scientific discovery

The challenges of dealing with the rapid growth of data in materials science-related fields has long been recognized.1–3 With more recent advances in computer science, the tools for advancing data-intensive scientific discovery have opened the door for more engagement from the scientific community. As suggested by Gray, this has created “The Fourth Paradigm: Data-Intensive Scientific Discovery.”4 He pointed out that experimental, theoretical, and computational science were all being affected by the data deluge, and a fourth “data-intensive” science paradigm was emerging. Indeed, we are witnessing materials science being greatly affected in the new era of “data-centric” materials science, which will likely become the new paradigm for materials research and education.

For more than a decade, MRS Bulletin has published issues related to the nexus of data science and materials science, including materials informatics5 and microstructural informatics.6 In this issue, we continue to expand on those themes by focusing on the numerous efforts in developing and utilizing databases of electronic structure calculations, and their impact on addressing different classes of problems in materials science.

Computational high-throughput screening

First-principles calculations with predictive performance play an essential role in data-centric materials science. In 1990s, researchers were able to make first-principles calculations of 10–100 inorganic crystalline compounds at most with less than a few atoms in a unit cell with a level of accuracy comparable to experiments. Density functional theory (DFT) is a reasonable way to fulfill the accuracy level without prohibitive computational costs. Today, developments of computational hardware and software have enabled computations of 10⁵–10⁶ compounds having much larger unit cells. These results have been stored in databases such as the Materials Project (MP) (materialsproject.org), AFLOW (aflowlib.org), OQMD (oqmd.org), NOMAD (www.nomad-coe.eu), and Materials Cloud (www.materialscloud.org).

In order to construct such databases, powerful software tools to automate computational engines to run thousands of simulations are essential, as are application programming interfaces (APIs) for the resulting databases. Complex sequences of calculations are encoded into scientific workflows. Robust tools to store, search, and disseminate big data are important as well, and scientists benefit greatly from them. Such software platforms are described in this issue.7–11

When a target property can be accurately computed by DFT without excessive computational cost, high-throughput screening (HTS) within the DFT database is a straightforward strategy. These types of screening approaches have been used to design and discover materials with a wide range of properties, including those for structural, electronic, functional, and energy materials. Unfortunately, many materials properties are not directly computable by DFT. For some materials properties that can be computed, the computational expense precludes a HTS approach. In these instances, descriptors or features that correlate with the target property may instead be used for HTS. Examples of useful descriptors have been found through physical considerations and the knowledge of experts.12–15

Initially, DFT databases were developed for crystal structures registered in experimental databases, such as the International Crystal Structure Database (ICSD).16 A DFT calculation can find an equilibrium structure corresponding to the local minima of the potential energy surface, which is called geometry optimization. However, the optimization is typically limited within a local structure space fixing the number of atoms in the given unit cell. In addition, this local optimization is most often made using the symmetry of the starting configuration, and does not allow switching to other symmetries. In other words, the thermodynamic stability of the compound is far from guaranteed if the structure of a hypothetical compound is simply optimized locally.

Methods to perform global structure optimization based on strategies such as the evolutionary algorithm,17 particle swarm optimization,18 minima hopping,19 and random structure searching20 have been developed and successfully applied to many examples using program packages such as USPEX (http://uspex-team.org/en/), CALYPSO (www.calypso.cn), CALYPSO (www.calypso.cn), and AIRSS (www.mtg.msm.cam.ac.uk/Codes/AIRSS). However, these are computationally demanding for exploration of the vast chemistry space composed of possible combinations of chemical elements. For example, the number of combinations exceeds one billion for quaternary systems with simple composition ratios. Additionally, even with these global optimization tools, which search for minimum energy structures at a given composition, thermodynamic stability is still often not guaranteed, since a stable compound must be lower in free energy than phase separation into all possible decomposition products. The utility of large-scale databases becomes apparent when assessing thermodynamic stability, since these databases allow comparison of the energy of any compound under consideration with all possible combinations of phases included in the database. The major databases previously listed (MP, AFLOW, OQMD) all have automated the construction of these “convex hulls” to assess thermodynamic stability. Figure 1 illustrates the HTS scheme using a DFT database, along with the convex hull concept. If one wants to expand the search space beyond the set of known compounds to “as yet-unknown” compounds, the thermodynamic stability should be examined before or after the screening. However, computational demand increases dramatically when one goes beyond the boundaries of known compounds. Hence, it remains a challenge to explore the vast chemistry space exhaustively by only using DFT calculations.

One significant deficiency of current databases is that they contain (for the most part) only experimentally synthesized compounds for which the crystal structure has been determined. Examination of diffraction databases such as the Powder Diffraction File21 shows that there are on the order of 10⁴–10⁵ experimentally synthesized inorganic compounds whose crystal structures have not been solved. Solution of these structures followed by subsequent DFT calculations would enable a large increase in the size of available databases. Methods such as the first-principles assisted structure solution (FPASS) have been developed and applied to this problem of automating structure solution.22,23 However, the computation expense of the methods still leaves a large number of unsolved compounds today. In addition, all DFT databases (MP, AFLOW, OQMD) largely or completely ignore compounds that contain partial occupancy in the crystal structure. Although methods such as cluster expansions of special quasi-random structures24,25 could address these partially occupied structures, the automated use of these tools in HTS still presents challenges. Since a large fraction of the total number of compounds experimentally reported have partial occupancy, a solution of these challenges would also represent a large expansion of the data set.

Machine-learning models for formation energy and other physical quantities

If one can obtain a good “guess” of formation energy by machine learning (ML) using a large set of DFT calculations as training data, the thermodynamic stability of an arbitrary compound can be assessed without computationally demanding DFT calculations. Attempts at such ML models have been carried out for 134,000 small organic molecules in the GDB-9 database.26–29 The accuracy of these ML models is comparable to target values not only for the energy, but also for geometry, harmonic frequency, dipole moment, and polarizability. For inorganic crystals, ML models with reasonable accuracy have been reported as well.30–34 In some cases, errors in the formation energy from these ML models (relative to DFT) were estimated to be close to the errors of DFT relative to experiments.35 These ML models are thus becoming useful for rapid screening to select candidates for detailed examination.32,33,36

Scientific intuition suggests that the energetics and properties of compounds are determined not only by their chemical compositions, but also by their structures. Consequently, ML models with high accuracy typically use structural descriptors as well as elemental descriptors. The need for structural descriptors limits the use of ML models for the exploration within an unknown compound domain, since the structural descriptors cannot be a priori provided for unknown compounds. Even when the compound of interest (e.g., at the extremum of a target property) with respect to structural and elemental descriptors is predicted by an ML model, there is currently no robust approach to reconstruct the crystal structure from these descriptors.

Instead of making ML models for energy or other quantities by a regression approach, one can use a classification approach to judge whether a compound is relevant for further investigation. Attempts to find chemically relevant compositions (CRCs), where the presence of a stable compound is anticipated, have been made using ML models.30,37–39 In a similar manner, a CRC with a high metallic glass-forming ability within experimentally unexplored composition domains was recently successfully predicted and experimentally validated.40,41 These and other efforts in the ML domain have demonstrated the power of application of these nascent tools for materials problems.

Experimental big data analysis and databases

Progress in digitally controlled microanalysis tools has enabled acquisition of big data from nano structure s with atomic resolution. There are many examples of this, such as the analysis of hyperspectral image data obtained by transmission electron microscopy,42,43 and topological data analysis of atom probe tomography images.44 A high-throughput synthesis (thin films) and characterization approach with composition and temperature gradients across the substrate has been systematically conducted and the outputs are stored in the high-throughput experimental materials (HTEMs) database (www.htem.nrel.gov).45 Linking such data to theory and the related assessment of accuracy of measurements in HTS can help in making combinatorial libraries become a source of generating reference data.46 Materials systems for such experiments can be selected using ML models based on DFT databases and other preceding databases, as described earlier in this article. Such combinations are expected to accelerate data-driven discovery.

Further, we had noted46 that “when combinatorial experiments are coupled to the plethora of HTS techniques, they can then serve as experimental platforms for linking length scales and time scales and, hence, multiscale modeling.” To accomplish this, “combinatorial library synthesis needs to be linked to the significant advances in computational modeling,” and this is one area of research that opens new trajectories for harnessing electronic structure databases. With advances in experimental capabilities coupled to the availability and access to large amounts of robust electronic structure data, the foundations for an integrated workflow between experiment and theory can be laid (Figure 2).47

Finally, it may be useful for readers to refer to some other materials databases. NIMS (National Institute for Materials Science) in Japan provides a wide range of materials databases, MatNavi,48 with basic properties of polymers, inorganic materials, and metals, together with experimental materials datasheets such as creep data. The Materials Data Facility (MDF)49 (www.materialsdatafacility.org), a pilot project funded by NIST, provides a scalable repository where materials scientists can publish, preserve, and share research data. Citrination50 is an open database of materials data collected by Citrine Informatics. The Nanoporous Materials Genome Center has produced a database of nanoporous materials, including existing and proposed zeolites, metal-organic frameworks (MOFs), and porous polymer networks (PPNs). Textural properties (surface area and void fraction) have been calculated for all materials, and adsorption properties for gases such as hydrogen, methane, and CO₂ have been simulated for large numbers of these materials.51

In this issue

Currently, several DFT databases of structure, formation energy, and other materials properties for 10⁵–10⁶ inorganic compounds are available. Combining ML tools, the databases have been utilized for discovery and design of new materials and solving different classes of problems in materials science.

The article by Pizzi et al. in this issue7 presents automation software for preparing and performing multistep computational workflows. The AiiDA program manages the execution of dynamic workflows ensuring a format reusable in different projects and by different researchers. The way to integrate some tools for the automated computation is explained as well.

The Ye et al. article8 describes the MP database, which contains DFT results for most of the known inorganic materials. Features or descriptors useful for applying the ML techniques to the database are explained. Examples of the data-accelerated materials design are then showcased.

The AFLOW database contains DFT results for more than 1.8 million materials, including hypothetical compounds. The Oses et al. article9 in this issue illustrates how they combined the database with ML tools in order to make thermodynamic formability modeling feasible. Construction of electronic structure fingerprints is explained as well.

The Ward et al. article11 presents another DFT database, OQMD. Applications of informatics techniques for accelerated materials discovery and extraction of design rules are described. A data-centric approach in experimental materials science is given as well. Future perspectives for the continued expansion of materials informatics applications are then discussed.

The activity of a European Centre of Excellence, NOMAD, is explained by Draxl and Scheffler10 in this issue. It collects computed data obtained by the most important first principles codes. It can also manage the data of other databases, such as MP, AFLOW, and OQMD, to feed into the ML process. The outlook for handling experimental data is then discussed.

The Seko et al. article42 describes the data-centric approaches used for characterization and design of nanostructures of materials, which is called nanoinformatics. Combination of ML techniques with DFT data and digitally controlled microscopy and spectroscopy data are shown to be powerful for exploration of the design spaces.

Looking forward

In this issue of the MRS Bulletin, we have focused on electronic structure databases or properties that can be derived from electronic structure calculations. There are many ongoing efforts that are compiling data on different genres of materials, their chemistry, properties, and characterization. As these databases increase in size and diversity, one needs to begin to consider the development of other functionalities of databases. These include the ability to find ways to merge the knowledge derived from different types of databases that capture multiscale information. The integration of information will help to create a new paradigm for the next generation of databases—transforming them from repositories of data to “laboratories” where information and data are fused to help unravel the complexity of materials engineering problems.52–55

References

P. Murray-Rust, J. Mol. Graph. 11, 268 (1993).
Google Scholar
D. Feller, J. Comput. Chem. 17, 1571 (1996).
Google Scholar
A. Dalby, J.G. Nourse, W.D. Hounshell, A.K.I. Gushurst, D.L. Grier, B.A. Leland, J. Laufer, J. Chem. Inf. Comput. Sci. 32, 244 (1992).
Google Scholar
T. Hey, S. Tansley, K. Tolle, Eds., The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, Redmond, WA, 2009).
Google Scholar
“Materials Informatics,” MRS Bull. 31 (12), (2006).
“Microstructure Informatics in Process–Structure–Property Relations,” MRS Bull. 41 (8), (2016).
G. Pizzi, A. Togo, B. Kozinsky, MRS Bull. 43 (9), 696 (2018).
Google Scholar
W. Ye, C. Chen, S. Dwaraknath, A. Jain, S.-P. Ong, K.A. Persson, MRS Bull. 43 (9), 664 (2018).
Google Scholar
C. Oses, C. Toher, S. Curtarolo, MRS Bull. 43 (9), 670 (2018).
Google Scholar
C. Draxl, M. Scheffler, MRS Bull. 43 (9), 676 (2018).
Google Scholar
L. Ward, M. Aykol, B. Blaiszik, I. Foster, B. Meredig, J. Saal, S. Suram, MRS Bull. 43 (9), 683 (2018).
Google Scholar
S. Curtarolo, G.L.W. Hart, M.B. Nardelli, N. Mingo, S. Sanvito, O. Levy, Nat. Mater. 12, 191 (2013).
Google Scholar
B. Meredig, C. Wolverton, Chem. Mater. 26, 1985 (2014).
Google Scholar
R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, C. Kim, NPJ Comput. Mater. 3, 54 (2017).
Google Scholar
O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, A. Tropsha, Nat. Commun. 8, 15679 (2017).
Google Scholar
FIZ Karlsruhe—Leibniz Institute for Information Infrastructure, “Inorganic Crystal Structure Database,” http://www2.fiz-karlsruhe.de/icsd_home.html.
A.R. Oganov, C.W. Glass, J. Chem. Phys. 124, 244704 (2006).
Google Scholar
Y. Wang, J. Lv, L. Zhu, Y. Ma, Phys. Rev. B Condens. Matter 82, 094116 (2010).
Google Scholar
M. Amsler, S. Goedecker, J. Chem. Phys. 133, 224104 (2010).
Google Scholar
C.J. Pickard, R.J. Needs, J. Phys. Condens. Matter 23, 053201 (2011).
Google Scholar
International Centre for Diffraction Data, http://www.icdd.com.
B. Meredig, C. Wolverton, Nat. Mater. 12, 123 (2013).
Google Scholar
L. Ward, K. Michel, C. Wolverton, Phys. Rev. Mater. 1, 063802 (2017).
Google Scholar
A. Zunger, S.H. Wei, L.G. Ferreira, J.E. Bernard, Phys. Rev. Lett. 65, 353 (1990).
Google Scholar
A. Seko, I. Tanaka, Phys. Rev. B Condens. Matter 91, 024106 (2015).
Google Scholar
R. Ramakrishnan, P.O. Dral, M. Rupp, O.A. von Lilienfeld, Sci. Data 1, 140022 (2014).
Google Scholar
F.A. Faber, L. Hutchison, B. Huang, J. Gilmer, S.S. Schoenholz, G.E. Dahl, O. Vinyals, S. Kearnes, P.F. Riley, O.A. von Lilienfeld, J. Chem. Theory Comput. 13, 5255 (2017).
Google Scholar
S. Chmiela, A. Tkatchenko, H.E. Sauceda, I. Poltavsky, K.T. Schütt, K.-R. Müller, Sci. Adv. 3, e1603015 (2017).
K.T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Müller, A. Tkatchenko, Nat. Commun. 8, 13890 (2017).
Google Scholar
B. Meredig, A. Agrawal, S. Kirklin, J.E. Saal, J.W. Doak, A. Thompson, K. Zhang, A. Choudhary, C. Wolverton, Phys. Rev. B Condens. Matter 89, 094104 (2014).
Google Scholar
A.M. Deml, R. O’Hayre, C. Wolverton, V. Stevanovic, Phys. Rev. B Condens. Matter 93, 085142 (2016).
Google Scholar
F.A. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Phys. Rev. Lett. 117, 135502 (2016).
Google Scholar
L. Ward, R. Liu, A. Krishna, V. Hegde, A. Agrawal, A. Choudhary, C. Wolverton, Phys. Rev. B Condens. Matter 96, 024104 (2017).
Google Scholar
A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, I. Tanaka, Phys. Rev. B Condens. Matter 95, 144110 (2017).
Google Scholar
S. Kirklin, J. Saal, B. Meredig, A. Thompson, J. Doak, M. Aykol, S. Ruhl, C. Wolverton, NPJ Comput. Mater. 1, 15010 (2015).
Google Scholar
J. Schmidt, L. Chen, S. Botti, M.A.L. Marques, J. Chem. Phys. 148, 241728 (2018).
Google Scholar
G. Hautier, C.C. Fischer, A. Jain, T. Mueller, G. Ceder, Chem. Mater. 22, 3762 (2010).
Google Scholar
A. Seko, H. Hayashi, H. Kashima, I. Tanaka, Phys. Rev. Mater. 2, 013805 (2018).
Google Scholar
A. Seko, H. Hayashi, I. Tanaka, J. Chem. Phys. 148, 241719 (2018).
Google Scholar
L. Ward, A. Agrawal, A. Choudhary, C. Wolverton, NPJ Comput. Mater. 2, 16028 (2016).
Google Scholar
F. Ren, L. Ward, T. Williams, K.J. Laws, C. Wolverton, J. Hattrick-Simpers, A. Mehta, Sci. Adv. 4, eaaq1566 (2018).
A. Seko, K. Toyoura, S. Muto, T. Mizoguchi, S. Broderick, MRS Bull. 43 (9), 690 (2018).
Google Scholar
M. Shiga, S. Muto, in Nanoinformatics, I. Tanaka, Ed. (Springer, Singapore, 2018), p. 179.
T. Zhang, S.R. Broderick, K. Rajan, in Nanoinformatics, I. Tanaka, Ed. (Springer, Singapore, 2018), p. 133.
A. Zakutayev, N. Wunder, M. Schwarting, J.D. Perkins, R. White, K. Munch, W. Tumas, C. Phillips, Sci. Data 5, 180053 (2018).
Google Scholar
K. Rajan, Annu. Rev. Mater. Res. 38, 299 (2008).
Google Scholar
Lawrence Berkeley National Laboratory, US Department of Energy Office of Science, ALS-U, “ALS-U: Solving Scientific Challenges with Coherent Soft X-Rays: Workshop Report on Early Science Enabled by the Advanced Light Source Upgrade” (2017), available at http://als.lbl.gov/wp-content/uploads/2017/08/ALS-U-Early-Science-Workshop-Report-Full.pdf.
National Institute for Materials Science, “NIMS Materials Database,” http://mits.nims.go.jp/index_en.html.
Materials Data Facility, https://materialsdatafacility.org.
Citrination, https://citrination.com.
C.M. Simon, J. Kim, D.A. Gómez-Gualdrón, J.S. Camp, Y.G. Chung, R.L. Martin, R. Mercado, M.W. Deem, D. Gunter, M. Haranczyk, D.S. Sholl, R.Q. Snurr, B. Smit, Energy Environ. Sci. 8, 1190 (2015).
Google Scholar
K. Rajan, Ed., Informatics for Materials Science and Engineering: Data Driven Discovery for Accelerated Experimentation and Application (Elsevier, Oxford, 2013).
Google Scholar
C.S. Kong, M. Haverty, H. Simka, S. Shankar, K. Rajan, Model. Simul. Mater. Sci. Eng. 25, 065014 (2017).
Google Scholar
S. Srinivasan, S.R. Broderick, R. Zhang, A. Mishra, S.B. Sinnott, S.K. Saxena, J.M. LeBeau, K. Rajan, Sci. Rep. 5, 17960 (2015), doi:10.1038/srep179601.
S. Broderick, K. Rajan, Sci. Technol. Adv. Mater. 16, 013501 (2015).
Google Scholar

Download references

Acknowledgments

I.T acknowledges the Japan Society for the Promotion of Science (JSPS) for a Grant-in-Aid for Scientific Research on Innovative Areas “Nano-Informatics” (18H05195) and a Grant-in-Aid for Scientific Research (A) (18H03843); Japan Science and Technology Agency (JST) through Materials Research by Information Integration Initiative (MI²I). K.R. acknowledges support from the National Science Foundation (NSF) DIBBs Project, Award No. ACI-16-40867. C.W. acknowledges support from the Center for Hierarchical Materials Design and from the US Department of Commerce, National Institute of Standards and Technology under Award No. 70NANB14H012.

Author information

Authors and Affiliations

Department of Materials Science and Engineering, Elements Strategy Initiative for Structural Materials of Kyoto University, Japan
Isao Tanaka
Department of Materials Design and Innovation, University at Buffalo, The State University of New York, USA
Krishna Rajan
Department of Materials Science and Engineering, Northwestern University, USA
Christopher Wolverton

Authors

Isao Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Krishna Rajan
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Wolverton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isao Tanaka.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tanaka, I., Rajan, K. & Wolverton, C. Data-centric science for materials innovation. MRS Bulletin 43, 659–663 (2018). https://doi.org/10.1557/mrs.2018.205

Download citation

Published: 10 September 2018
Issue Date: September 2018
DOI: https://doi.org/10.1557/mrs.2018.205

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data-centric science for materials innovation

Abstract

Similar content being viewed by others

A bridge for accelerating materials by design

The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools

The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools

Data-intensive scientific discovery

Computational high-throughput screening

Machine-learning models for formation energy and other physical quantities

Experimental big data analysis and databases

In this issue

Looking forward

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Navigation

Data-centric science for materials innovation

Abstract

Similar content being viewed by others

A bridge for accelerating materials by design

The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools

The Materials Project: Accelerating Materials Design Through Theory-Driven Data and Tools

Data-intensive scientific discovery

Computational high-throughput screening

Machine-learning models for formation energy and other physical quantities

Experimental big data analysis and databases

In this issue

Looking forward

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation