Data representation and file format
Material microstructures come in many different sizes and shapes and the features of interest have different dimensionalities. Data describing attributes of microstructure can be obtained from many sources (Scanning Electron Microscopy, Transmission Electron Microscopy, Optical Microscopy, Electron Backscatter Diffraction, Energy Dispersive Spectroscopy, Wavelength Dispersive Spectroscopy, 3D Atom Probe, Atomic Force Microscopy, etc.). Unfortunately, during the development of these experimental methodologies, no common data structure was developed and as such, combining data from multiple sources is difficult. Further, the tendency to link the data with a material class (metal, ceramic, composite, polymer, etc.) has stunted the development of a unified method for describing microstructure data. During development of DREAM.3D, the vision of a unified representation of all digital microstructure data for all material classes and length-scales presented a challenge. As discussed in [6], when writing code or designing data structures that operate on or represent a variety of features and dimensions, it is critical to establish a proper abstraction layer to ensure transferability. In the case of DREAM.3D and microstructure, the authors believe the proper abstraction layer is to work with all features of structure as geometrical objects. By abstracting the materials interpretation of the features and focusing only on how the feature is described digitally, DREAM.3D has been able to institute a general, unified structure for digital data that assumes no prior knowledge of length-scale or material class. The following subsections will discuss this generic data structure and illustrate its direct use in a wide range of materials applications.
Geometric mesh element construct for holding digital microstructure data
Spatially-resolved digital data, of which most material microstructure data is a subset, are simply information or attributes that are associated with discrete geometrical elements. These elements can be pixels in an image, points in a probe scan, line segments in a digital model, etc. At this level, all digital microstructure data can be treated/organized similarly within a computer. Meshes of appropriate dimension can be created and data can sit on the mesh element(s) that they describe. For example, the mass-to-charge ratio of an atom in an atom probe dataset is information associated with a point, while the misorientation across a boundary in a electron backscatter diffraction (EBSD) dataset is information associated with a surface. As such, any given dataset has an associated mesh dimensionality equal to the highest dimension of feature its data describes. It should be noted that the mesh dimensionality may be different from the dimensionality of the dataset. For example, the atom probe dataset consists of 3D locations having (x,y,z) coordinates, but represents microstructure features that are treated as a 0-D point.
DREAM.3D organizes/stores mesh data (and subsequent feature and ensemble data discussed in the next section) in a structure called a “data container”. DREAM.3D uses four types of data containers for the different possible data dimensionalities (Vertex = 0D, Edge = 1D, Surface = 2D, Volume = 3D). Figure 1 illustrates the different data containers and the data they can hold. As Figure 1 shows, lower dimensional geometrical objects bound higher dimensional objects and a given data container can store data on mesh elements of a lower dimension. For example, in a 3D EBSD dataset, the collected orientation data is generally treated as belonging to a cell, but the misorientation between neighboring cells could also be stored on the face shared by the cells and the edges and vertices of the cells could store the coordination number of different features they belong to (i.e. triple line or quadruple point). An example dataset of each type of data container can be found in the supporting material. The examples include a Vienna Ab initio Simulation Package (VASP) input structure (Vertex data container - Additional file 1), a ParaDis output structure (Edge data container - Additional file 2), a grain boundary mesh of a synthetic polycrystalline microstructure (Surface data container - Additional file 3) and a synthetic polycrystalline microstructure (Volume data container - Additional file 4).
The mesh that represents the data locations is unique to the dataset itself. While the mesh can be altered via smoothing, regridding or other processing steps, it is generally defined by the data collection or generation protocol/settings. Furthermore, the mesh itself is not influenced by the material class and can exist at any length-scale. The mesh is solely the physical location of all data elements and their associated attributes.
Hierarchical grouping for feature and ensemble representation
A given material’s microstructure can be thought of as being constructed using building blocks called “features” such as grains, fibers, pores, magnetic domains, corrosion pits, dislocations, individual atoms and many other possibilities. Though these features are very different in the “real world” material’s sense, digitally they are all simply groups of discrete mesh elements. It is the user’s prerogative to group the mesh elements in whichever way makes most sense for their uses, which imparts a certain uniqueness to the data set. It is the human interpretation of what the features represent that links the data to a specific material class and/or length-scale. DREAM.3D utilizes a software engineering technique where all of the domain specific groupings can be represented by a generalized data structure. This is commonly referred to as an “Abstraction Layer” in the software engineering field and allows the DREAM.3D system to grow and adapt to new domains.
From the perspective of the computer, the act of assigning elements to a given feature is still material class and length-scale independent. Mesh elements are simply noted to belong to a given feature for a given segmentation/grouping protocol. For each grouping/segmentation protocol, all elements are set to belong to one and only one feature. It is possible that a user would want to group mesh elements by multiple protocols. For example, mesh elements could be grouped by common orientation and then by common chemistry if a data set had both orientation and chemical information. If multiple grouping protocols are used, then each mesh element would have a vector of feature IDs listing which feature it belongs to in each grouping.
After features are defined, attributes such as size, shape, etc. can be calculated and stored associated with each feature. The structure of how these attributes are stored will be discussed in the next section. Also, it may be desirable to the user to group features together to establish “ensembles”. Ensembles are groups of features that the user has linked for some reason. Similar to each mesh element having one (or more) feature IDs to list what feature it belongs to, each feature has one (or more) “ensemble IDs”. For example, a group of features could be linked because they are all the same phase, because they are the largest 10% of features, etc. Similar to features and individual elements, attributes describing ensembles such as size distribution, average feature curvature, orientation distribution function (ODF), etc. can be calculated and stored associated with each ensemble.
Scalable layout for information storage
At all levels, from the individual mesh elements to features and ensembles, the method of how information is stored must be dynamic. In order to be a flexible software environment that can work with data from multiple sources and treat microstructures from all material classes, it is not reasonable for DREAM.3D to predefine what attributes can be associated with a mesh element, feature or ensemble. As such, a matrix-style container is needed for holding information of this type. For example, in an EBSD scan, each pixel has an Euler angle set, a phase ID, a coordinate in space and a list of values associated with the indexing approach of the commercial software that collected the scan. These attributes, as a set, are called a ‘property vector’ in DREAM.3D and define the pixel with which they are associated. These property vectors are shown as columns in Figure 2. The rows in Figure 2 are the lists of single attributes for all pixels and are called “attribute array”. Given this container structure, it becomes clear that as filters are applied to the data, more attribute arrays are generated and each property vector grows.
At each level (mesh element, feature, ensemble), attribute matrices can exist. Only one matrix exists at the element level because there is no user grouping at that level and as such there is only one definition or instance of the mesh. However, at the feature and ensemble levels, many attribute matrices can coexist. In an attribute matrix, every property vector is the same size and every attribute array is the same size. This is because filters calculate attributes and filters must loop over all members in the attribute matrix for which the attribute is being calculated.
HDF5 File structure
The Hierarchical Data Format Version 5 (HDF5) is an open-source library developed and maintained by “The HDFGroup” [7] that implements a file format designed to be flexible, scalable, highly performant and portable. HDF5 allows each application to organize its data in a hierarchy that makes sense for the application. Virtually any type of data, from scalar values to complex data structures, can be stored in an HDF5 file. Scalability has been a design consideration from the outset and HDF5 can handle data objects of almost any size or dimensionality. The library has also been designed to be efficient at querying, reading and writing data objects, and including utilizing parallel I/O when needed. One of the most important aspects of HDF5 is its portability across all the major computing operating systems. HDF5 has support for C, C++, Fortran and Java as its native implementations; many higher-level programming languages also have direct support for HDF5, including IDL (Interactive Data Language), MATLAB and python.
HDF5 files can be thought of as ‘a file system within a file’. Data can be stored as datasets (analogous to files) and arranged inside groups (analogous to folders) all within the HDF5 file. This structure is well-suited for storing the organized data from DREAM.3D. The organization of a typical DREAM.3D file is shown in Figure 3. At the ‘root directory’ or highest level in the file, two groups exist for holding 1) the processing pipeline and 2) all data containers of the dataset. Inside the pipeline group, there are subgroups for each filter and within each subgroup there are datasets for each of the input parameters of the filter. The subgroups are titled as their numerical order in the processing pipeline, but have attributes stored on the group listing the name of the filter and its version number. The datasets inside the subgroups are titled as the name of the input parameter they hold and the contents are the value(s) of the input parameter. Inside the data container group are subgroups for each data container that exists in the dataset. The subgroups are titled as the name the user gave to the data container. Within each subgroup there are multiple groups (the number depending on the dimensionality of the data container). Each group at this level is associated with an attribute matrix described in the previous section. For example, if the data container was a vertex data container, then there would be a group for the vertex mesh element attribute matrix and there could be multiple groups of feature and ensemble attribute matrices depending on the number of grouping schemes employed by the user. In the example in Figure 3, the dataset contains a single volume data container. Within an attribute matrix group, each dataset represents an attribute array (or row from Figure 2). The name of the dataset is the name of the attribute array and the contents are the entire attribute array in order from object 1 to N.
The structured layout of HDF5 and the DREAM.3D file also offer potential for databasing of datasets. The ability of HDF5 to query the existence of datasets and groups without reading the entire file is well-suited for determining if data meets a specified criterion, whether it be a specific processing path, attribute array, etc.