Introduction

Reliance on shared digital data in scientific and engineering pursuits - whether the data are derived from computation or experiment - is becoming more commonplace within the materials science and engineering (MSE) community. Concurrently, government policies across the globe are embracing an 'open science' model which sets a requirement for sharing digital data generated from publicly funded research. A recent joint Materials Research Society and The Minerals, Metals and Materials Society (MRS-TMS) survey on 'big data' in materials science and engineering showed that 74% of respondents would be willing to participate in sharing their data if it was encouraged as a term and condition of funding or publishing, assuming the proper safeguards were in place [1]. However, it is fair to say that the MSE community currently lacks the strategy, framework, standards, and culture needed to support materials data curation and sharing. A unified approach is needed to meet the growing demands of the community and a plan to meet government requirements for broad access to digital data. It is clear that the peer-reviewed journals and government-sponsored technical reports serving the MSE community can be an essential component to the solution, and there is now an opportunity to proactively plan how they may best serve the growing needs of their constituency.

We have structured this paper to, first, provide the reader a general awareness of the global environment and ongoing activities concerning the management of research data. We then present a perspective on the benefits of data archiving to the MSE community and outline what attributes and characteristics a data archiving solution should have. We follow this with a discussion of key challenges yet facing the establishment of a digital materials data infrastructure. Finally, we propose a way forward to tackle the creation of community-based solutions for data archiving policies and data repositories.

Review

Global context

The 2008 NRC report on Integrated Computational Materials Engineering (ICME) highlighted the importance digital data will play in the future of materials science and engineering [2]. MSE's ever increasing reliance on computational modeling and simulation will demand digital data as the feedstock for solutions in both science and engineering.

In the USA, the National Institutes of Health have long promoted a policy of open access to data generated from their grants [3]. In the mid-1990s, the Human Genome Initiative spawned the Bermuda Principles which called for immediate public posting of sequences of the human genome [4]. More recently, the National Science Foundation has adopted a requirement that applicants provide a data management plan in grant proposals [5]. Specific to the materials community, the sharing of digital data is a key strategy component of the US's Materials Genome Initiative (MGI), and mechanisms to foster and enable sharing are actively under consideration [6].

The European Union has been very proactive in studying the impacts of a digitally linked world on the scientific community. The EU Framework Programme 7 funded a project called Opportunities for Data Exchange that has produced several relevant reports on publishing digital data in the scientific community [7]. And, in June 2012, the Royal Society published 'Science as an open enterprise' which promotes free and open access to scientific results, including data [8]. These studies are now broadly informing government policy. For example, recent policy issued in the UK in July 2012 calls for government-funded research to be published in open access journals and requires access to supporting research data [9]. In February 2013, Dr. John Holdren, director of the Office of Science and Technology Policy (OSTP), issued a directive to all federal agencies to develop plans to make the results of federally funded research more accessible to the public [10]. A key component of this directive is a call for agency plans to include a means by which digital data resulting from research can be made available to the public. In support of this policy, the White House has established a useful web site providing resources supporting the establishment of open data [11]. The US Government funding agencies have since provided their plans to address OSTP's open research policy, and results are imminent.

Other technical communities have addressed the challenges of access to digital data with a variety of approaches. Indeed, the biology community has implemented a number of differing approaches; for example, the approach taken in genetics versus that adopted by evolutionary biology [12],[13]. In other disciplines, one subfield of thermodynamics has already adopted a very structured approach to archiving data, while the earth sciences community has embarked on an effort to define its approach [14],[15]. The astronomy community has dedicated international resources to the development of the Virtual Observatory, an infrastructure that enables global data discovery and access across hundreds of distributed archives [16]. Despite the differing mechanics of implementation, all the approaches were rooted in a community-led effort to define the path best suited for that particular technical field.

In response to these trends, technical communities and publishers have developed and implemented open access journals and data archiving policies. Again, the field of biology appears to be leading the way on both these fronts. One example of this trend is Database: The Journal of Biological Databases and Curation, an open access journal dedicated to the discussion of digital data in biology [17]. And in a recent development, Nature Publishing Group has launched a new open access journal, titled Scientific Data, which is dedicated to publishing descriptions of scientific datasets and their acquisition [18]. It will initially focus on life, biomedical, and environmental science communities. Furthermore, the Public Library of Science recently strengthened its policy on data access: 'PLoS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception' [19].

In order to begin a dialog within the MSE community, the National Institute of Standards and Technology (NIST) convened a workshop on digital materials data in May of 2012 under the auspices of MGI. The workshop identified a number of barriers that need to be addressed during creation of a data strategy for materials, they include: materials schema/ontology, data and metadata standards, data repositories/archive, data quality, incentives for data sharing, intellectual property, and tools for finding data [20]. Other disciplines, notably evolutionary biology, have demonstrated that peer-reviewed journals have the potential to contribute solutions to these barriers to data sharing [21].

Benefits of archiving materials science and engineering data

There is a growing realization within the global scientific community that the data generated in the course of research is an oft overlooked asset with considerable residual value to other scientists and engineers and that often a significant portion of the data is stored but not accessible. The following are several anticipated benefits of increasing access to materials science and engineering data in digital form:

Data reuse

Scientific productivity and return on investment in research infrastructure.

Secondary hypothesis testing.

Reducing/eliminating paying for data generation multiple times.

Comparing with previous studies.

Integrating with previous and future work.

Reproducing and checking analyses.

Simplifying and enhancing subsequent systematic reviews and meta-analyses.

Facilitating interdisciplinary research.

Teaching.

Incentives

Increasing academic credit (citations).

Access to one's own data at a future date.

Convenience and security of cloud storage.

Others

Testing algorithms/computations with validated reference datasets.

Meeting funding agency requirements to share data.

Reducing the potential for duplication of effort.

Reducing of error and fraud.

The MRS-TMS 'big data' survey asked participants to evaluate whether given attributes would act as impediments or motivators to sharing data, Figure 1[1]. The bottom of the graph shows that the largest impediments are primarily driven by legal considerations. The top of the graph demonstrates that the strongest positive motivators are the increased attention and credit a researcher may draw for one's work. It is clear that widespread sharing of digital materials data will require not only technological advances but also cultural shifts that include modernization of traditional incentives for the sharing of scholarly works to include recognition for publication of data. While there is no universally accepted solution available at present, new tools such as the Thomson Reuters Data Citation Index may help provide avenues for this recognition [22].

Figure 1
figure 1

MRS-TMS big data survey result. Responses to the question: 'Do you consider the following items to be impediments or motivation for you to share your data with the world?' The abscissa depicts the response rate to the three choices: impediment, neutral, or motivation.

The impact on research productivity owing to the provision of well-calibrated, well-documented, archival data products is clearly demonstrated in the case of NASA's Hubble Space Telescope (HST). Initially, archival data was not used very extensively; the data suffered from spherical aberration, of course, resulting in a factor of approximately ten decrease in sensitivity from expectations. But in the early 1990s, there was also somewhat of a stigma attached to using archival data for research: this was somehow not as good or pure as collecting one's own data at a telescope. But times have changed, and HST archival data is now used extensively by astronomers unaffiliated with the teams making the original observations. The resulting research papers account for more than half of all peer-reviewed publications based on HST observations (see Figure 2). There are a number of reasons for the big increase in archival data use. HST observing time is very difficult to get, with typically a seven-to-one oversubscription ratio in the proposal process. All HST data is routinely pipeline processed, yielding an archive of 'science ready' data products. All HST data becomes public after a nominal 12-month embargo period. And HST data taken for one purpose can often be utilized for studies of a substantially different intent. While this high level of reuse may not be achieved for all research experiments, the HST example clearly shows that a substantial improvement in research productivity can be achieved, at a very modest incremental cost, when proper care is taken in designing the data management system.

Figure 2
figure 2

Use and reuse of archived data from the Hubble Space Telescope. HST data are used approximately twice as often in research papers written by scientists with no connection to the original investigators proposing the research. This more than doubles the productivity of HST at a marginal extra cost of providing well-calibrated data in an easy to access archive.

In materials science, there is a strong case that data obtained at great public expense should be made available to as large a group of researchers as is practical, noting that, unlike astronomical data, there can be important constraints due to both national security and intellectual property concerns. Acknowledging these issues, however, materials data obtained from national user facilities, such as data obtained by scattering of synchrotron light or neutron beams, are examples of a scarce yet valuable information stream. Indeed, these facilities have recognized the importance of good archiving capability but have not focused on distribution of these data. Making such information more widely available would increase the amount of materials knowledge discovery with a small investment relative to the costs of repeating the work.

Background for data archiving

A common approach to archiving materials data has several benefits, but its primary value would be to provide unified, consistent guidance and expectations throughout the scientific and engineering community. However, while the development of an archiving policy itself may be relatively straightforward, the infrastructural issues necessary to support policy implementation are extraordinarily complex. These issues include the establishment of viable

Repositories for materials data.

Standards for data exchange.

Citation and attribution protocols.

Data quality metrics.

Intellectual property and liability determinations.

Characteristics of an archiving solution

In order for a data archiving solution to be of lasting value to researchers and maintain the rigorous, archival standards of relevant publications, it should have the following minimum set of characteristics:

Persistent citation.

Data discoverability.

Open access (for journals).

Ease of use.

Minimal cost.

To archive or not to archive?

The most critical question to be answered in setting policies for publications is 'what data should be archived?' The answer is essential in providing clear expectations for authors, editors, and reviewers, as well as determining the size of the data repositories needed. Other disciplines have already embarked on this journey and have devised a variety of approaches that suit the data needs of their communities for their stage of 'digital maturity.' Two ends of the spectrum in addressing this question are presented here. The first assumes all data supporting a publication are worthy of archiving. This criterion is found most often in peer-reviewed journals that have narrow technical scope and generally deal with very limited data types. For example, journals in crystallography and fluid thermodynamics have very stringent data archiving policies that prescribe formats and specific repositories for the data submitted [23],[24]. Other journals that cover broader technical scope, and therefore deal with more heterogeneous data, have implemented more subjective criteria for data archiving and a distributed repository philosophy. Earth sciences and evolutionary biology have typically taken this approach. It is likely that the approach adopted by MSE publications may also span a similar spectrum, depending on the scope of the publication.

The MRS-TMS 'big data' survey provided insight into the community's perspective on the relative value of access to various types of materials data, shown in Figure 3. It is interesting to note that as the complexity of the data and metadata increase (generally) toward the right-hand side of the chart, the community's perceived need to have access to this data decreases. This could be due to many factors including the difficulty in assuring the quality of such data as well as the lack of familiarity with tools to handle the data complexity. However, with complexity comes a richness of information that if properly tapped could be extraordinarily valuable. In astronomy, for example, the Sloan Digital Sky Survey created a very complex database of attributes of stars, galaxies, and quasars. The wealth of information and immense discovery potential led many in the research community to become expert users of SQL and for the survey to yield nearly 6,000 peer-reviewed publications.a

Figure 3
figure 3

MRS-TMS big data survey result. Responses to the question: 'What scientific/technical databases and data mining tools would be most useful if they could be created?'.

For those publications with wide technical scope, it will be difficult to provide a universal answer to 'what data should be archived?' In these cases, the decision for what data to archive may best be left to the judgment of the authors, peer reviewers, and editors. A particularly useful metric might be the cost/effort to produce the data. For example, the 'exquisite' experimental data associated with a high-energy diffraction microscopy experiment provide very unique, expensive, and rich datasets with great potential use to other researchers [25]. Clearly, based on these factors, the dataset should be archived. On the other hand, the results from a model run on commercial software that takes 5 min of desktop computation time may not be worthy of archiving as long as the input data, boundary conditions, and software version were well defined in the manuscript. Of course, one must account for the perishable nature of code, particularly old versions of commercial code. However, even the data from the common tensile test may be worthy of archiving as publications do not typically provide the entire curve; while the paper may report only yield strength, another researcher may be interested in work hardening behavior. Having the complete dataset in hand allows another researcher to explore alternative facets of the material's behavior. The basic elements of the criteria for determining the data required for archiving could include the following:

Are the data central to the main scientific conclusions of the paper?

Are the data likely to be usable by other scientists working in the field?

Are the data described with sufficient pedigree and provenance that other scientists can reuse them in their proper context?

Is the cost of reproducing the dataset substantially larger than the cost of archiving the fully curated dataset?

Is the dataset reproducible at all, or does it stem from a unique event or experiment?

Data itself can come in a variety of 'processed' levels including 'raw', 'cleaned', and 'analyzed'. Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. While raw or cleaned data is much preferred for its relative simplicity in reuse, it is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-acquisition (experiment or computation) processing performed.

Another factor to consider in setting the guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and Air Force Research Laboratory (AFRL) found that most articles in the survey had less than 2 GB of supporting data per paper. Currently, the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3D serial sectioning and high-energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated. Repositories such as the Dryad digital repository show that datasets of this magnitude can be indefinitely stored at a cost of $80 or less [13]. However, datasets approaching 500 GB will very likely require a different approach for storage and access. Thus, a data repository strategy needs to consider this range in distribution of datasets. An additional factor when considering long-term storage requirement is the high global rate of growth in materials science and engineering publications. Figure 4 shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.

Figure 4
figure 4

Number of materials science and engineering publications per year. Source: Thomson Reuters Web of KnowledgeSM.

Data repositories

Aside from crystallographic data repositories, there are at this time perhaps no dedicated materials data repositories that meet the required characteristics defined above. The materials science and engineering community does have numerous publicly accessible data repositories; however, the majority of these are associated with specific projects or research groups, and their persistence is therefore dependent on individual funding decisions. These repositories are primarily established to house and share the research data generated within a specific project or program. They generally do not follow uniform standards for data and metadata nor provide for data discoverability and citation. There are very few repositories established with the explicit objective of providing MSE with public repositories for accessible digital data. In short, publicly accessible, built-for-purpose repositories and the associated infrastructure for access, safe storage, and management still need to be developed and sustainably funded - this is the largest impediment to implementing viable data archiving policies. (See, for example, 'sustaining domain repositories for digital data: a white paper' [26]). In establishing their Joint Data Archiving Policy, journals in evolutionary biology did not prescribe specific repositories; instead, they allowed a mix of repositories to be used by authors as long as they met established criteria. Such criteria may be as simple as requiring data cited to be permanently archived in data repositories that meet the following conditions:

Publicly accessible throughout the world.

Committed to archiving data sets indefinitely.

Allow bi-directional linking between paper and dataset.

Provide persistent digital identifier.

One tempting option might be to take advantage of the online storage capability several journals already offer for supplementary materials accompanying journal articles. However, as presently constructed, these are not amenable to best practices for dataset storage as they generally are not independently discoverable, searchable, separately citable, nor aggregated in one location. In fact, some publishers are reducing or eliminating supplementary file storage due to the haphazard structure and rules associated with their use. Further, new global government policies promoting open access to research works have the publishing industry in a state of flux with regard to their long-standing, subscription-based business model. Publishers have been reticent in taking on a data archiving responsibility given the economic uncertainties in the publishing marketplace.b Also, there is a possibility that some for-profit publishers could try to restrict access to digital data assets that are co-located with the journal.

As alluded to in the previous section, a fundamental consideration in repository design and/or selection is the level to which the repository will present structured versus unstructured data. Structured technical databases tend to be more useful to a technical community due to their uniformity, as evidenced by their data reuse rate.c A perfect construct would see the vast majority of materials data resident within structured repositories. A disciplined data structure provides enormous advantages to the researcher both in terms of data discoverability and confidence in its use. However, this structure must be enabled by the application of broader and deeper standards for data and metadata, standards that do not currently exist.

In all likelihood, as in biology, MSE publications will be dependent on a collection of repositories that are tailored to specific materials data. For example, NIST is building and demonstrating a data file repository for CALPHAD (Calculation of Phase Diagrams) and interatomic potentials [27]. These may be expandable and largely sufficient for thematic publications such as those devoted to thermodynamics and diffusion. However, repositories such as this will only fill a relatively small niche need in MSE. Integrating Materials and Manufacturing Innovation is piloting an effort to link articles with their supporting data using the NIST repository according to the criteria outlined above, an example can be found in an article by Shade et al. [28],[29].

Finally, a business model for sustainably archiving materials data is required. Other technical fields, such as earth sciences, can at least partially rely on government-provided repositories for large and complex datasets. Without these types of repositories to build on, MSE will need to establish viable repository solutions. In response to funding-agency requirements for data management plans, some universities, Johns Hopkins for example, are beginning to provide centrally hosted data repositories, but these are not yet common [30]. Private fee-for-service repository services, such as LabArchives and Figshare, are also evolving to meet growing demand for accessible data storage [31],[32]d. Additionally, ASM International is working to create a prototype materials data repository through its close association with Granta Design. Termed the Computational Materials Data Network (CMDN), this is a promising option as the data repository will provide a structured database specifically for materials data; but the business model for CMDN has not yet been solidified [33]. A key open question remains how funding agencies will respond to the OSTP open research policy memo, and how they will fund activities making data open to the public.

Standards enabling data discoverability, exchange, and reuse

As noted in the previous section, standards for data and metadata provide the basis for a structured data archive, enabling the rapid discovery of data and assisting in determining the data's relevance and usefulness. At the most basic level, good data practice generally requires the generation, and acceptance, of a vocabulary defining the terms used to describe reported data. This assures the data user that they precisely understand the context of the data they are reviewing. From this level, other attributes, features, or requirements can be levied on a data management system including ontologies, schema, and formats [34].

Other fields have studied these issues as a community, and MSE is now reinvigorating a concerted effort to define its approach to setting data standards. Serious efforts to address standards for materials data, particularly structural materials data, were undertaken as long as 30 years ago [35]. In 1985, ASTM International established Committee E49 on Computerization of Material Property Data to develop standard guidelines and practices for materials databases [36]. ASTM International devoted quite substantial effort over a decade and issued data standards relevant to materials through the 1990s, specifically addressing key issues such as how to describe materials, how to record data, data quality indicators, harmonization of terminology, and guidelines for building and distributing databases. The standards have been since withdrawn, but those such as ASTM E1314-89, 'Practice for Structuring Terminological Records Relating to Computerized Test Reporting and Materials Design Formats', are clearly in need in today's environment - though in more web-enabled format [37]. ASTM International has been reviving its efforts to provide guidance on the digitization of materials test data by exploring the re-establishment of its computerization and networking of materials databases symposium series [38]. The European Union is studying the creation of standards for the exchange of engineering materials data through the European Committee for Standardization [39]. The target for these standards is structural materials with an early emphasis on aerospace applications. And the European Commission is funding a broader activity called the Integrated Computational Materials Engineering expert group (ICMEg) with the aim of developing the standards and protocols needed to support the digital exchange of materials data needed to conduct ICME [40]. Several recent papers have proposed standards for other types of materials data to include thermodynamic and image-based data [41],[42]. There are also closed-loop approaches to materials data standardization that exist within commercial data management software packages, but these are not generally available to the public.

While the field of information technology is continuously evolving to provide solutions to more productively using unstructured data, at present there is no community-wide accepted practice for MSE data and metadata standards. Near-term solutions for governing the archiving of materials data will need to be relatively loose, flexible, and evolutionary with a drive toward more standardization. While publishers may not be able to directly provide data repository services, they are well positioned and willing to aid the community in establishment of data standards. In the pursuit of standardization across a technical field, Michael Whitlock, a primary champion of journal data archiving in the field of evolutionary biology, offered this quote from Voltaire based on his experience: 'the perfect is the enemy of the good'.e It is perhaps much more important at this stage of our digital maturity that MSE first implement data archiving with the best guidance available and work to build in standardization over time.

Data citation and attribution

Well-developed and uniform data citation standards are required to ensure that linkages between publications and datasets are enduring and that creators of digital datasets receive appropriate credit when their data are used by others. Standards for data citation practices and implementation provide the mechanism by which digital datasets can be reliably discovered and retrieved. Closely related to data citation, other challenges include the ability to reliably identify, locate, access, interpret, and verify the version, integrity, and provenance of digital datasets [43]. Any data archiving policy must concern itself not only with how publications should appropriately cite the datasets used but must also require attribution to authors of datasets outside the document.

Numerous organizations in the EU and USA have studied this issue and are continuing to refine technology solutions and best practices. For example, CODATA and the National Academy of Sciences released an in-depth international study and recommendations on citation of technical data [44]. Recently, these transnational initiatives have coalesced to produce a unified Joint Declaration of Data Citation Principles that is appropriate for any type of technical publication [45]. The eight principles define the purpose, function, and attributes of data citations and address the need for citations to be both understood by humans and processed by machines. With a slightly different perspective focused more on the mechanics of linking published articles with data repositories, DataCite and the International Association of Scientific, Technical and Medical Publishers have issued a joint statement recommending best practices for citation of technical datasets in journals [46]. Two of the key recommendations include encouraging authors of research papers to deposit researcher validated data in trustworthy and reliable data archives and encouraging data archives to enable bi-directional linking between datasets and publications by using established and community-endorsed unique persistent identifiers such as database accession codes and digital object identifiers (DOIs).

An outstanding technical issue in data citation yet to be resolved concerns the granularity of the datasets used in a publication, both spatially and temporally. Spatial granularity refers to a subset of the dataset used in the research. Temporal granularity can refer to either the version of the dataset used or the temporal state of the dataset used if the dataset itself is dynamic.

Data quality

A key concern in linking datasets to publications is the provision of quality metrics; that is, can the data's ultimate reliability be assessed in a meaningful manner? Materials data can be provided as two basic types: experimental and computational; both types assume underlying models. In order for data and these associated models to be usable, their quality must be ascertained. In this context, it is useful to define the following for data and models:

Pedigree - where did the information come from?

Provenance - how was the information generated (protocols and equipment)? This metadata should be sufficient to reproduce the provided data.

In addition to these qualitative descriptors of the data, there are a number of meaningful quantitative measures of the data's quality. However, in general the following metrics are a strong basis for such an assessment:

Verification - (applies to computational data only) how accurately does the computation solve the underlying equations of the model for the quantities of interest?

Validation - how much agreement is there between realizations of a model in experiment and computational, or, rarely, analytic, results?

Uncertainty - what is the quantitative level of confidence in our predictions?

Sensitivity - how sensitive are results to changes in inputs or upon assumed boundary conditions?

Similar, and perhaps more difficult, problems pertain to simulation data. While such data may be perfectly precise in a numerical sense, simulations typically rely on many parameters, assumptions, and/or approximations. In principle, if the above are specified, and the quantitative metrics meet user requirements, the data can be used with a high level of confidence. A similar approach to defining data quality was recently proposed within the context of the Nanotechnology Knowledge Infrastructure Signature Initiative within the National Nanotechnology Initiative [47].

An often posed question in the research community with regard to data associated with peer-reviewed journal articles is that of peer review of the data itself. Indeed, it has been reported that approximately 50% of data being reviewed for submission to the American Mineralogist Crystal Structure Database contained errors [48]. The elements defined above represent the key criteria by which to judge the quality of the data. General pedigree and provenance information are typically conveyed in most research articles, though they may be provided in insufficient detail to reproduce the data. The remaining elements of validation, verification, uncertainty, and sensitivity are relatively loosely defined within materials science and engineering, and best practices have not generally been developed for each element, or, where developed, are not in widespread use. For further discussion on these topics, the reader is directed to [49] and [50].

Intellectual property and liability

There is quite a bit of complexity and even ambiguity with regard to the legal protections governing scientific data [51]. In general, scientific data are treated as facts and therefore not copyrightable under US law. However, the aggregation of the data into a single compilation or database may be copyrightable in the US. Additionally, and importantly, the codes, formats, metadata, data structures, or any 'added value' to the data could also be subject to copyright. Laws in other parts of the globe, particularly the European Union, add complexity to the situation. The EU's database directive, for example, protects the wholesale use of databases by other parties without permission.

There may be instances where the authors of a document may not want their data released immediately on publication of the supported manuscript. They may have very good, justifiable grounds to protect their data for some period following publication. One likely reason may be additional time required to file an invention disclosure related to the data. Another case may be that the authors are in the midst of writing another manuscript dependent on the same data. To account for these special cases, the associated publication should have allowance to grant the author an embargo period to protect the data for a short time after document publication. Typically, by granting an embargo, the author must post the supporting data to a repository prior to manuscript publication, but the data is not released to the public until the embargo period has expired. This is a standard practice in other technical disciplines, with limits of 12 months being typical and at the discretion of the editor.

Proprietary and export control restrictions may also affect the release of the metadata associated with the dataset and could warrant embargo or even permanent withholding of the entire metadata description. Take a researcher who has been provided a quantity of material by an industrial partner. The researcher may be free to report on a newly observed deformation phenomenon in the material with respect to its microstructure but may be restricted by the partner in providing proprietary details about how the material was processed. In this case, the metadata may not contain the full pedigree and provenance needed to reproduce the experimental results. Export control provides an analogous situation; the data may not be restricted, but the metadata needed to provide full pedigree and provenance may reveal export-controlled information [52]. Allowances for the withholding of metadata from publication must be in place, and these decisions to either accept the embargo or reject the dataset should be left to the reviewer and editor. It should be noted in publication policy that authors take full responsibility for review and release of proprietary and export-controlled information.

Given the discussion above regarding intellectual protection of data, policy regarding the requirements for licensure of data for reuse should be made clear. Of course, one must also consider where the data repository resides, so any policy may have somewhat limited scope. Creative Commons has developed a series of free copyright licenses for public use when sharing creative works [53]. For example, one option may be to require all deposited data to be covered by a CC-BY license, as defined by Creative Commons. A CC-BY grants free use of data by all parties, including for commercial use but does require attribution. However, other data repositories, such as the Dryad digital repository, have chosen to implement a Creative Commons Zero (CC0) license in order to remove any barriers to data reuse [13]. CC0 dedicates the work to the public domain and does not legally require attribution to the data source; instead Dryad relies on community norms for proper attribution of data. This option is particularly suitable in a case where a researcher uses data from hundreds or thousands of data sets in their work, making citation of all sources impractical. And, as noted at the beginning of this subsection, even the question of applicability of a copyright license to technical data is still open. Still, unanswered questions also linger regarding any liability issues with making data accessible. Again, consideration must be given to where the data reside (who is making it available) as to liability determination.

Archiving policy

A potential path forward is to establish a working group(s) comprised of members from the MSE community to craft a common data archiving policy. Such a policy should address the following:

  1. 1.

    A general definition of the data to be archived that is flexible enough to meet specific publication needs.

  2. 2.

    Criteria for suitable repositories.

  3. 3.

    Expectations or requirements to follow data or metadata standards.

  4. 4.

    Definition of standards for data citation and attribution.

  5. 5.

    Requirements and/or measures for data quality.

  6. 6.

    Clarity on intellectual property and liability issues.

  7. 7.

    Areas of opportunity for targeting pilot data archiving efforts (e.g., thermodynamic data).

Repositories

A complementary working group(s) from the MSE community should also be commissioned to develop a plan to provide supporting repositories for the MSE community. Some anticipated tasks and options include

  1. 1.

    Catalog and explore the suitability of and potential for existing materials repositories to host datasets associated with peer-reviewed journals (e.g., NIST CALPHAD database).

  2. 2.

    Explore the use of other established journal data repositories for their suitability for MSE data (e.g., http://www.datadryad.org).

  3. 3.

    Engage funding agencies for help in establishing specialized MSE data repositories.

  4. 4.

    Develop a time-phased strategy to provide well-structured materials repository architectures.

  5. 5.

    Consider business models that would sustain these repository services over the long term.

Conclusions

The era of open science is upon us, and the MSE community must generate a response that best suits the needs of not only the individual researcher but also the larger community including academia, industry, and government. It is becoming clearer with the advance of materials research that supporting data can no longer be kept invisible from a technical publication. This paper has outlined the key issues that will need to be considered as the community develops an approach to data archiving supporting publications. Charting the right course will take time, and much effort as it is quite complex. Fortunately, other technical disciplines have begun a path for us from which we can learn and capitalize. Some suggested community-based actions have been outlined that would help pave the way in setting a common approach to archiving of materials data.

Endnotes

aThis is based on a query to the Astrophysics Data System, http://adsabs.harvard.edu/, for peer-reviewed papers mentioning either 'SSDS' or 'sloan' in the title or abstract of the paper. A query executed on 9 April 2014 resulted in 5,825 papers.

bCHW and JAW discussion with AAP, STM, AIP, ACS, Elsevier (2012).

cCHW discussion with A. Acharya, Google, Inc. (2012).

d'Certain commercial equipment, instruments, or materials (or suppliers, or software, …) are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the US Government, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose'.

eCHW discussion with M. Whitlock, U. British Columbia (2012).