UHCSDB: UltraHigh Carbon Steel Micrograph DataBase
We present a new microstructure dataset consisting of ultrahigh carbon steel (UHCS) micrographs taken over a range of length scales under systematically varied heat treatments. Using the UHCS dataset as a case study, we develop a set of visualization tools for interacting with and exploring large microstructure and metadata datasets. Based on generic microstructure representations adapted from the field of computer vision, these tools enable image-based microstructure retrieval, as well as spatial maps of both microstructure and related metadata, such as processing conditions or properties measurements. We provide the microstructure image data, processing metadata, and source code for these microstructure exploration tools. The UHCS dataset is intended as a community resource for development and evaluation of microstructure data science techniques and for creation of microstructure data science teaching modules.
KeywordsMicrostructure Processing Steels Computer vision
We introduce a microstructure dataset  focusing on complex, hierarchical structures found in a single ultrahigh carbon steel (UHCS) alloy under a range of heat treatments performed by Hecht et al. [2, 3]. In a concurrent report, we use this dataset to evaluate several microstructure representations based on contemporary computer vision research, and discuss application of both supervised and unsupervised machine learning methods to yield insight into microstructure–properties relationships .
This document describes the contents of the UHCS dataset  in detail and outlines the data visualization tools we developed for exploring microstructure datasets with processing and/or properties metadata. We reflect on our experience using a simple SQL database to manage microstructure and processing metadata instead of choosing one of the emerging materials data standards. We also present a responsive web application with microstructure-query and metadata visualization tools, currently accessible online at http://uhcsdb.materials.cmu.edu.
The UHCS microstructure and metadata dataset can be used by the materials community to define benchmark microstructure data science tasks, such as microstructure classification, microstructure clustering, and developing data-driven microstructure models for processing-structure-properties relationships. The dataset can also support the development and evaluation of new data-driven microstructure representations that address changes in physical scale, magnification, and sample orientation. Finally, the microstructure data visualization tools we present can be reused and extended to enable exploratory analysis of large microstructure datasets.
Materials and Methods
Nominal as-cast composition of the present UHCS alloy
Listing of quench methods
650 ° C for 1 h
Annealing temperatures in °C
Annealing times in minutes (M) and hours (H)
The microscope used to collect these micrographs did not export any imaging metadata in a machine-readable format. Because the same SEM happened to be used for each image, the human-readable metadata accompanying the micron bar at the bottom of each image was laid out consistently (refer to Fig. 1a–f). As a result, we were able to use a semi-automated approach to recover some imaging metadata, including the magnification, imaging mode, and most importantly the physical scale of each image in microns per pixel. In this dataset, the micron bars consistently have the highest aspect ratio out of any of the white elements on the black metadata panel, making it trivial to obtain their length in pixels. We extracted the textual metadata using tesseract OCR , an open source optical character recognition system. Because each text metadata field has a consistent bounding box, we can crop the corresponding image patch from the metadata panel and pass it to tesseract to obtain each metadata field as a string of characters.
Due to the wide variety of formats used for micron bars, this sort of automated metadata recovery is not possible in general for microstructure datasets where the imaging metadata was not preserved. Additionally, reliably using OCR to extract image metadata requires substantial tuning and review. One common error we encountered was substitution of pm or um for scale bar units shown in μm; in this instance, the number of unique results is low enough to manually identify and programmatically correct each type of erroneous reading. These factors highlight the need for ubiquitous and standardized storage of imaging metadata at the point of collection.
Internally, we use a SQLite database to manage the microstructure metadata and link it to raw image files and numerical microstructure representations. Raw images are stored as plain png and tif files, and numerical microstructure representations are stored in a simple HDF5 format.
The relational structure of the metadata (where multiple micrographs share the same processing metadata) make this organization a clear choice over common text-based (comma-separated value, json) and binary (HDF5) formats for tabular data by reducing the complexity of code written to query, manipulate, and update the data. The binary data and UHCSDB web application URLs associated with the micrograph records in the SQLite database are organized using the integer primary keys for the Micrograph table (i.e. Micrograph.micrograph_id). For example, the Micrograph.path field stores a relative filesystem path to the corresponding raw png or tif format microstructure image file (micrograph1219.tif for a micrograph with a primary key of 1219). Similarly, the URL http://uhcsdb.materials.cmu.edu/visual_query/1219 requests microstructure-based search results for micrograph 1219.
Additional binary data associated with each micrograph (e.g., reduced-dimensionality microstructure representations) are stored in HDF5 format indexed with the integer primary key from the corresponding record in the Micrograph table. Microstructure representations from each method described in  are stored in separate HDF5 files (one HDF5 file for each method). Each vectorial microstructure representation is stored in a dataset named with the corresponding primary key in the Micrograph table: the feature vector for micrograph 1219 is stored in the HDF5 dataset /1219. The reduced-dimensionality microstructure representations are stored in a similar format, except that microstructure representations for each dimensionality reduction technique are organized into HDF5 groups, so that the t-SNE map point for micrograph 1219 is stored in the HDF5 dataset /t-SNE/1219. Parameters specific to each dimensionality reduction algorithm are stored as attributes to the top-level HDF5 group, including the implementation we used (e.g., sklearn.decomposition.PCA for the scikit-learn implementation of principal component analysis) to simplify reproducibility.
The main advantages of a SQL database over the more specialized emerging materials data formats are the simplicity, stability, and ubiquity of plain SQL, and the surrounding ecosystem of related supporting libraries available in many popular programming languages. Specifically, tools like the python libraries sqlalchemy and pandas allow users unfamiliar with database systems to interact with the data by writing plain python code, instead of learning a new database system or query language (e.g., SQL). For the binary data, HDF5 carries the advantage of accessibility to multiple programming languages compared with, e.g., matlab or numpy binary formats, and offers more structure and performance compared with plain text formats. These factors simplify the process of loading microstructure image data and processing metadata for use with analytic and exploratory tools.
The most significant disadvantage of using a custom SQL schema is its inflexibility. The SQLite schema presented in this section was designed for expedience in organizing the data for the experiments presented in , and will not generalize to new microstructure datasets with different sets of processing and properties metadata. Moving forward, two general options are available: commit to one of the emerging materials data formats (e.g., Citrine’s PIF [7, 8] or Materials Commons ), or iteratively adapt custom organizations while mapping out the data and infrastructure requirements of microstructure data science applications. As the microstructure community converges on data standards and data infrastructure matures and stabilizes, well-documented custom data formats can readily be converted into standard formats and integrated into community repositories.
Tools for Exploratory Analysis of Microstructure Datasets
The ability to concisely describe, evaluate, and synthesize large bodies of microscopy work performed over an extended period of time in a collaborative environment is a challenge of long-term, large-scale materials research projects. Often, the institutional memory surrounding microstructure data strongly depends on the humans involved. Even where data is stored digitally, it is typically inaccessible for automated analysis, and it may be difficult for humans to locate and discover specific pieces of data1. The global image representations discussed in  support multiple novel tools of exploring microstructure datasets and scaling up collaborative research efforts by enabling new means of interacting with and exploring microstructural image datasets. High-dimensional nearest neighbor search can rank micrographs by some measure of microstructural similarity (“Microstructure Query Tool”). Dimensionality reduction algorithms can also be applied to display thumbnail images (“Offline t-SNE Image Maps”) or processing/properties metadata on a spatial microstructure map (“Interactive Metadata Visualization Tool”).
We built a simple microstructure-oriented responsive web application using open source tools, allowing users to interactively explore the UHCSDB metadata and microstructure dataset via our microstructure dataset exploration tools. Such a web application can easily be deployed locally on a personal machine or local network for internal use, or on the public internet via appropriate infrastructure (web server, hosting, domain name registration, etc.). The UHCSDB web application is currently available at http://uhcsdb.materials.cmu.edu. See “?? Archival Data Accessibil- ??ity” for details on accessing the full microstructure dataset along with source code for the data visualization tools.
Microstructure Query Tool
The primary interface of the UHCSDB web application is the microstructure query tool which, given a micrograph, conducts a nearest neighbor search for images in the dataset with similar microstructural content. The nearest neighbor search can operate on any suitable microstructure representation; here we use the multiscale convolutional neural network (CNN) representation described in . CNNs compose multiple layers (as many as one hundred layers in some modern architectures) of convolution filters to extract highly abstract image representations useful for a variety of visual, spatial, and auditory tasks . Our CNN representations are constructed from the internal activations of a 13-layer CNN trained to perform an object recognition task . We combine activations from multiple scales in the input images to obtain representations that are more robust with respect to changes in magnification.
Offline t-SNE Image Maps
The dimensionality-reduction techniques commonly used in materials data science (e.g., principal component analysis) sometimes fail to adequately represent the structure of complex, noisy, and potentially nonlinear real-world data distributions, such as the CNN representations for the present UHCS micrographs. While we explored multiple dimensionality reduction techniques (see “?? Interactive Meta- ??data Visualization Tool”), we found that t-SNE (stochastic neighbor embedding)  consistently yields high quality data visualizations for the UHCS microstructure data. The t-SNE algorithm  yields a low-dimensional representation by using a stochastic optimization procedure that attempts to conserve the local neighborhood structure of the high-dimensional data, rather than the global structure as in PCA. The ability of t-SNE to reveal local structure in high-dimensional data comes at the cost of distorted depiction of larger distances. t-SNE heavily penalizes large low-dimensional distances between pairs of map points that have small high-dimensional distances, but effectively ignores pairs of points with large high-dimensional distances. Thus, large distances between pairs of data points in the low-dimensional maps produced by t-SNE carry little significance. Despite this compromise, t-SNE often provides interesting and visually useful depictions of real-world high-dimensional datasets , often in settings where linear techniques break down.
The microstructure map in Fig. 4 shows some of the key microstructures in the UHCS dataset; the inset scatter plot shows the full t-SNE map with the selected view indicated by the black frame. Colors indicate the primary microconstituent labels. The image map is best viewed electronically in its complete form, available in the supplemental materials along with maps for additional microstructure representations. This microstructure map was obtained by applying t-SNE to multiscale fifth block CNN features as described in detail as mVGG5 features in .
The main focus of this microstructure map view is the initial pearlitic structure. Starting from the bottom right of the map and moving upwards, the pearlite structure changes from high-magnification images of fine pearlite to lower-magnification views of coarser, more complex pearlite structures. On the left half of the map, there is a cluster of pearlitic microstructures containing spheroidized cementite, and at the top left corner of this view there are several micrographs with extensive Widmanstätten cementite. These microstructure maps are useful for summarizing large bodies of characterization work collected over a potentially long time frames in a clear and concise manner.
The bottleneck in this process (for the relatively small UHCS dataset at least) is the dimensionality reduction step. Because we have precomputed the reduced-dimensionality representations these microstructure maps are easy to generate on demand, subject to the practical constraints of sufficient server-side resources or a client-side implementation.
Interactive Metadata Visualization Tool
As shown in Fig. 5d, placing the mouse cursor over a scatterplot marker will display a thumbnail image of the corresponding micrograph, along with some relevant processing metadata. Clicking on a marker will open an URL that triggers a microstructure query for the corresponding micrograph record. This view is not as complete as the microstructure maps presented in “?? Offline t-SNE Image ?? Maps”, but allows the user to explore relationships between microstructure and metadata in a more interactive manner.
Currently, the (precomputed) reduced-dimensionality methods available on the UHCSDB web application include principal component analysis (PCA) , t-SNE , multidimensional scaling [MDS, shown in Fig. 5d] , locally linear embedding (LLE) , Isomap , and spectral embedding . Presently, we precompute reduced-dimensionality representations for each microstructure representation, but in principle they can be computed on demand. Most of these dimensionality reduction methods require only a few seconds of computation for a dataset of this size and complexity, though t-SNE can require a few minutes, especially when computing multiple independent maps. Front-end (client-side) dimensionality reduction implementations may be useful for exploratory and collaborative deployments, compared with the precomputed dimensionality reduction workflow we employ presently.
The UHCS microstructures and metadata can serve the materials data science community as a source of benchmark tasks for evaluating and comparing microstructure representations. Though the dataset is small compared to many of the standard datasets used in the computer vision and robotics literature, it’s size is likely representative of microstructure datasets currently collected by individual researchers.
Microstructural complexity makes the UHCS dataset an interesting challenge for microstructure segmentation and representation. As in many important microstructure systems, the relevant microstructure features (especially spheroidized cementite in this case) vary in physical length scale as well as in the relative length scale of the image reference frame (i.e. differing magnification). These aspects make the UHCS dataset a useful resource for researchers interested in developing microstructure representations that are invariant, equivariant, or covariant to both scale and rotation. Similarly, the UHCS dataset is a promising resource for materials data scientists to build teaching modules for microstructure informatics techniques.
We are also currently using the microstructure dataset to explore application of semantic segmentation techniques to complex microstructure systems, which could support and accelerate conventional microstructure-based research. For example, Hecht et al.  used a laborious semi-automated process to segment the spheroidized cementite particles (e.g., in Fig. 1c, Fig. 1 in ) to enable their analysis of the cementite coarsening kinetics. They also manually traced branches of the proeutectoid cementite network (e.g., in Fig. 1b, Fig. 3 in ) and the surrounding particle-free denuded regions to support their particle coarsening analysis. Automation of these kinds of microstructure analytics tasks could significantly lower the cost of gathering and analyzing statistically meaningful quantities of microstructure data.
Finally, the web application and exploratory analysis tools presented in this manuscript can be repurposed and adapted for analysis of other microstructure datasets. These tools could help individual researcher groups scale up their analysis and interpretation of microstructure data internally. Additionally, combined with emerging data curation platforms such as Citrination , Materials Commons , and the NIST DSpace , these microstructure dataset visualization tools could impact the way researchers interact with the materials science literature, as is being done for numerical materials properties [7, 19, 20]. What if every materials characterization paper had an interactive microstructure and metadata supplementary publication, instead of merely including a select few “representative” micrographs? Integration of microstructure-based search and visualization tools into the materials data infrastructure could significantly improve discoverability.
Archival Data Accessibility
The complete set of micrographs, metadata, and web application source code are available on the NIST repository materialsdata.nist.gov .
Software dependencies used in this project
Database management system
General purpose programming
Reference t-SNE implementation 
Python dependencies not included in the python standard library
Interactive data visualizations
Neural network library 
Numerical computing library
Data frames library
Image processing library 
Statistical plotting library
Machine learning library 
We present an ultrahigh carbon steel microstructure dataset and suite of microstructure visualization tools. The UHCS dataset is a promising community resource for researchers interested in developing data-driven methods linking microstructure with processing/properties metadata. The dataset is also ideal for the creation of microstructure data science teaching resources to enable workforce development. We hope the microstructure and metadata visualization tools we present will be integrated into the burgeoning ecosystem of materials data repositories to increase the discoverability of microscopy datasets. Finally, these tools may help large collaborative projects scale up and speed up the microstructure collection, curation, and analysis components of their work.
We gratefully acknowledge funding for this work through National Science Foundation grants DMR-1307138 and DMR-1501830, and through the John and Claire Bertucci Foundation. Data visualization tool development by B.D., T.F., and E.H.; UHCS microscopy work by M.H., Y.P., and B.W. This work was supported in part by the Commonwealth of Pennsylvania Department of Community and Economic Development (DCED) Developed in PA program (D2PA), and by National Science Foundation grant CMMI-1436064. The as-cast and heat treated UHCS samples were provided by Miller Centrifugal Casting.
- 1.Hecht MD, DeCost BL, Francis T, Holm EA, Picard YN, Webler BA Ultrahigh carbon steel micrographs. https://hdl.handle.net/11256/940
- 3.HechtMD, Picard YN,Webler BA (2017) Coarsening of inter- and intra-granular proeutectoid cementite in an initially pearlitic 2c-4cr ultrahigh carbon steel. Metall and Mater Trans A 48(5):2320–2335Google Scholar
- 4.DeCost BL, Francis T, Holm EA (2017) Exploring the microstructure manifold: image texture representations applied to ultrahigh carbon steel microstructures. Acta MaterialiaGoogle Scholar
- 5.DeCost BL, Francis T, Holm EA UHCSDB microstructure explorer. http://uhcsdb.materials.cmu.edu. Accessed 14 April 2017
- 6.Smith R (2007) An overview of the tesseract ocr engine 9th International Conference on Document Analysis and Recognition, 2007. ICDAR 2007, vol 2. IEEE, pp 629–633Google Scholar
- 11.Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71Google Scholar
- 12.van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605. (Nov)Google Scholar
- 13.Jolliffe IT (2002) Principal component analysis, 2nd ed, Springer, New YorkGoogle Scholar
- 18.NIST repositories https://materialsdata.nist.gov. Accessed 14 April 2017
- 21.Chollet F (2015) Keras https://github.com/fchollet/keras. Accessed 22 May 2017Google Scholar
- 22.Van der Walt S, Schönberger J L, Nunez-Iglesias J, Boulogne F, Warner J D, Yager N, Gouillart E, Yu T (2014) scikit-image: image processing in python. PeerJ e453:2Google Scholar
- 23.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12:2825–2830Google Scholar