Informatics Infrastructure for the Materials Genome Initiative
- 3.8k Downloads
A materials data infrastructure that enables the sharing and transformation of a wide range of materials data is an essential part of achieving the goals of the Materials Genome Initiative. We describe two high-level requirements of such an infrastructure as well as an emerging open-source implementation consisting of the Materials Data Curation System and the National Institute of Standards and Technology Materials Resource Registry.
KeywordsMaterial Data Integrate Computational Material Engineering Informatics Infrastructure Material Genome Initiative Material Genome Initiative
New technologies are often limited by currently existing materials because the time to develop and deploy new materials generally exceeds the product design cycle. For example, it takes approximately 2 years to design a new jet engine using available materials, but it may take 10–15 years to design and certify the new materials needed for the engine.1 Integrated computational materials engineering (ICME) approaches have proven successful at decreasing this gap between the materials development cycle and product development cycle,2 but these approaches are not well developed for all classes and applications of materials, and there is a critical need for materials data and modeling tools that further enable these approaches.
To address the need to decrease the time and cost to develop and deploy new materials by 50%, President Obama announced the Materials Genome Initiative (MGI) in 2011.3 The MGI recognizes that advanced materials play a critical role in clean energy, human welfare, and national security. It is a multiagency initiative that focuses on the infrastructure needed to accelerate materials development, particularly in the following areas: (I) Computational Tools, (II) Experimental Tools, (III) Collaborative Networks, and (IV) Digital Data.
By facilitating the integration of data into developing ICME approaches and other computational approaches to materials discovery, design, development, and deployment, a materials data infrastructure that allows the wide range of materials data to be easily shared and transformed is essential to achieving the goals of the MGI.
As a part of this materials data infrastructure, the National Institute of Standards and Technology (NIST) is establishing essential data exchange protocols and the means to ensure the quality of materials data and models needed to foster widespread adoption of MGI approaches. This informatics infrastructure will play an important role, in particular, in the form of repositories that contain materials simulation and experimental data and metadata, models, and code. These repositories and other infrastructure will provide resources for use in the materials development process as researchers strive to create materials with targeted properties. NIST is particularly working to enable and enhance the exchange of materials resources across repositories, subdomains of the materials community, and industries. NIST is also working to assess and improve the quality of materials data, models, and infrastructure.
Users of these developing data resources come from diverse communities. Many informatics efforts are, by immediate necessity, ad hoc and organic as opposed to being top-down. Each community has its own data, metadata, and tools that are often incompatible. NIST believes that there is a need for new methods to enable the rapid definition of data and metadata, as well as a need for tools to enable rapid discovery and integration of these diverse data.
Materials researchers require a platform for interoperable exchange of materials data and metadata, which supports an approach of modular community-developed data standards.
Materials researchers need a decentralized infrastructure to enable finding and sharing of materials resources.
To meet the first requirement, researchers must have a system of data templates that can be designed to form custom containers for their experimental and simulation data and its associated metadata. These custom data formats will, however, be made from combinations of standardized components including community-developed templates that describe particular experiments or simulations and low-level reusable data types that encode data values and metadata fields in a standard way. As a result, it is anticipated that many of the issues associated with the current diversity of materials data formats will disappear without requiring researchers to force fit their data into monolithic data formats ill-suited to their needs.
Despite the success of Web-based search engines, they are in many ways not suited for searching for scientific resources. In this context, we use the term “resources” to include datasets and data collections or repositories, and information about organizations, application programming interfaces (APIs) and other information services, informational websites, and software. Simple text-based searches often return too many irrelevant results that require researchers to filter tediously through pages of output or to spend time devising clever search queries. Meeting the second requirement implies creating an informatics infrastructure that will enable materials researchers to search for materials data using metadata schemas with well-defined meanings. It will also enable them to make their data and other resources available to others using the same decentralized infrastructure.
The use of registries in informatics infrastructures is not new. In healthcare, registries support the task of identifying documents related to a patient in systems conforming to the Integrating the Healthcare Enterprise (IHE) Cross-Enterprise Document Sharing (XDS) integration profile.4 Metadata pertaining to a patient document are indexed in a registry that can be queried. In astronomy, the Virtual Observatory5 provides astronomers with a distributed ecosystem for data-based research that includes community-established data protocols, formats, and tools. A key component of the discovery framework is federation of data resource registries that contain searchable metadata about archives, data collections, and services that are available.6
In addition, various other scientific registries and support tools are being developed.7, 8, 9 The Research Data Alliance (RDA) Data Type Registries Working Group has defined a data model for the collection of scientific data and has implemented a prototype data type registry10 to facilitate the understanding of scientific data collected by different research groups.
Also, a variety of materials science-based efforts exist to improve the exchange of materials-based data. The Materials Intelligence system from Granta Design1 integrates materials data with a variety of software tools.
An important aspect of our architecture is the use of XML to structure data and metadata because this provides standardized methods for the encoding, interpretation, and transformation. We expect that user communities will work together to generate shared data and metadata models expressed as XML Schema. Our infrastructure then dynamically renders a GUI based on the schema to allow users to input data conforming to that schema. As MongoDB uses Binary JSON (BSON), a variant of JSON, to represent its data, we have created a translation layer that converts XML documents into the corresponding BSON and then back to XML as needed. The transformability of XML is also used to export retrieved XML documents to other formats. Currently, we allow for conversion to other text-based formats such as comma separated values (CSVs), but in principle any format can be generated, including graphics.
Our architecture has been implemented for Windows, Mac OS X, and Linux and is currently the basis for four systems: the Materials Data Curation System (MDCS), the NIST Materials Resource Registry (NMRR), the MGI Code Catalog (MCC), and the National Metrology Institutes Resource Registry (NMIRR). The first two systems will be discussed in more detail here.
Materials Data Curation System
The MDCS was designed to address the first high-level informatics requirement of the MGI that materials researchers need modular data models that capture their data and metadata in community-developed templates using reusable data types. The MDCS source code and installation instructions are available from https://github.com/usnistgov/MDCS.
One of the great strengths of XML is its ability to be transformed into other formats by using standard tools such as Extensible Stylesheet Language Transformations (XSLT), a programming language that uses XML syntax. The MDCS Exporter allows for the XML documents stored in the MDCS to be transformed into other formats such as CSV by using an XSLT stylesheet associated with the schema. This enables data stored in XML to be converted into tool-specific formats for use as part of scientific workflows.
NIST Materials Resource Registry
The NIST Materials Resource Registry (NMRR) was developed to address the second high-level MGI informatics requirement that materials researchers need to be able to find and share materials resources in a decentralized way. The source code for the NMRR is available from https://github.com/usnistgov/MaterialsResourceRegistry.
The NMRR and the MDCS are complementary systems where the MDCS can be used to make materials data accessible and the NMRR can be used to make materials data discoverable. From the perspective of the data consumer, a search on the NMRR returns candidate instances of the MDCS and other repositories. The user can then search an individual repository for candidate datasets.
The goal of the MDCS is to facilitate the collection, use, and reuse of materials data and to provide the needed informatics infrastructure to facilitate the implementation of ICME approaches. Several collaborators are using the MDCS for their own work. Northwestern University’s NanoMine, an online platform for the prediction of polymer nanocomposites, uses the MDCS to curate nanocomposite processing, structure, and property data reported in literature and then to link it to a variety of modeling tools.14 Raymundo Arroyave’s group at Texas A&M University is using the MDCS to collect data from computational materials science simulations and measurements of differential scanning calorimetry. At NIST, work is being done to curate both literature and experimental thermodynamic data with the MDCS. The NIST Thermodynamic Research Center is expanding ThermoML15,16 to include data on metals and plans to integrate their efforts with the MDCS.
The MDCS is also being integrated with the Interatomic Potentials Repository (IPR) Project.17 A recent article summarized the expanded scope of the IPR Project as a response to the MGI.18 Prior to the creation of the MDCS, metadata for interatomic potentials were manually curated in semistructured text files. As the project is working to enable selection of interatomic potentials based on material properties and other metadata, the MDCS is being used to curate all supporting data and metadata. Furthermore, rapid property calculation tools are being developed and directly integrated with the MDCS via its API. This combined toolset could also be used to develop new potentials, where local instances of IPR tools and the MDCS address data management issues associated with developing many different iterations or variants of interatomic potentials, as part of the typical development process.
A 2014 whitepaper indicated that high-throughput experiments (HTEs) are uniquely suited to meet many needs within the MGI by generating large volumes of high-quality experimental data suitable for model validation or model input.19 Efforts at NIST are focused on capturing data as it is generated on the synthesis or measurement apparatus and automatically transforming applicable data and metadata into XML formats, which are compliant with the MDCS. This effort is part of a broader effort to exchange samples and data across institutions to advance HTE metrology.
The open source software infrastructure presented in this work supports both data curation using modular data schema models for data exchange and decentralized data search platform. The MDCS will enable the materials science community to build and share community-based data models for the curation of specific data types. The Materials Resource Registry will improve the ability to find and share data with the metadata harvestable by other registries. Both the MDCS and the NMRR are designed to work with other data curation and sharing tools to further the aims of the MGI.
Certain commercial equipment or software is identified in this article to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the material or equipment identified is necessarily the best available for the purpose.
The authors would like to thank Raymundo Arroyave, Cate Brinson, Yannick Congo, Lucas Hale, Ya-Shian Li-Baboud, Greta Lindwall, Chris Muzny, Pierre Savonitto, Daniel Sauceda, and Richard Zhao for their support and contributions.
- 3.National Science and Technology Council, Materials Genome Initative for Global Competitiveness (Washington: Office of Science and Technology Policy, 2011).Google Scholar
- 4.Integrating the Healthcare Enterprise (IHE International 2015), http://www.ihe.net/. Accessed 24 March 2016.
- 7.Corda (Corporation of National Research Initiatives, January 2016), https://www.cordra.org/. Accessed 24 March 2016.
- 8.2nd Generation of Open Access Infrastructure for Research in Europe, OpenAIRE (Openaire Consortium, February 2016), https://www.openaire.eu/, Accessed 22 March 2016.
- 9.Research Data Switchboard (Research Data Alliance, March 2015), http://www.rd-switchboard.org/. Accessed 21 March 2016.
- 10.Data Type Registry (Corporation of National Research Initiatives, August 2014) http://typeregistry.org/registrar/. Accessed 13 July 2015.
- 14.C. Brinson, H.R. Zhao, NanoMine, http://brinson.mech.northwestern.edu/research/Nanomine.html. Accessed Feb 2016.
- 19.M.L. Green, J.R. Hattrick-Simpers, C.L. Choi, I. Takeuchi, A.M. Joshi, S.C. Barron, T. Chiang, A. Davydov, S. Empedocles, J. Gregoire, and A. Mehta, Fulfilling the Promise of the Materials Genome Initiative with High-Throughput Experimentation (MR Society, 2014), http://www.mrs.org/mgi-workshop-full-report/. Accessed 24 March 2016.