Introduction

The 2021 Materials Genome Initiative (MGI) strategic plan highlighted a need for unification across MGI as one of the most important problems to be solved [1]. In the context of defining our approach toward unification, we were drawn back to the original use of the term “genome” in the original MGI whitepaper [2]. It opened with the following quote: “A genome is a set of information encoded in the language of DNA that serves as a blueprint for an organism’s growth and development.” From our exploration of the nature and design of patterns and pattern languages, we came to understand that they are inherently “genomic.” For, in a general manner, they represent and systematically unfold knowledge of any structure or process which has been expressed in a unified grammatical form. And they form a basis for evolutionary growth and development of that community knowledge base over time. The approach of this paper is based on the recognition that many problems, processes, and structures that are required and utilized within MGI research and development (R&D) projects are instances of recurring problems that can be reused in a general and combinatory manner. The approach described here is one that remains relevant across scales, differences in domain, technology, developmental phases, and preferred development paradigms in various projects, domains, or efforts.

Using these ideas as a starting point, this paper outlines a promising approach for addressing the unification problem by using schemas and persistent identifiers to represent patterns and pattern languages in a common manner and through a common infrastructure. From these initial considerations, the paper proceeds by introducing the initial context of the unification problem, provides background regarding what pattern languages are, what the Configurable Data Curation System (CDCS) infrastructure is, and how their combination could provide a powerful approach toward addressing unification problems in MGI. The paper’s exploration of CDCS R&D projects not only shows historical growth, but also provides representative examples for types of activities that can benefit from this kind of unification over time.

The Problem of Unification in MGI: The Need for an Ecosystem-Level MGI Process Mapping

As we reflect upon a decade of Configurable Data Curation System (CDCS) infrastructure development in the Materials Genome Initiative (MGI), we return to the original whitepaper [2], the recent 2021 MGI strategic plan [1], as well as our own experiences. We note that over the course of MGI so far, certain things have remained invariant. These include the great deal of heterogeneity and complexity in materials science, its subdomains, and their activities, as well as in MGI as a whole. In considering the sources of this complexity, we realize that it may be addressed by identifying common patterns and themes that repeat across many scales and activities.

Figure 1 illustrates a key problem in MGI: the complexity which drives the need for unification. Even a simple enumeration of the types of complexity and variation evident in MGI shows a great deal of increasing sources of complexity. Indeed, by viewing MGI itself as a complex, evolving system, we began to identify important strategies for addressing its unification problems. MGI’s systems must be organized, coordinated, and maintained if they are to support accelerated, innovative materials science at community scale.

Fig. 1
figure 1

MGI problems in development process complexity and pattern-language solutions on CDCS

Two Solution Components: Pattern Languages and CDCS

Pattern Languages

We have identified an integrated approach that can address a number of these problems based on insights from a number of pioneers in complex modeling and community-scale problem-solving across science, mathematics, and engineering. The approach leverages community development and a unified paradigm for modeling and problem-solving at scale. It uses pattern-languages [3,4,5] to define, integrate, and coordinate lifecycle process models in every domain, project, and activity. In this way, a literal MGI process map and its generalizations could be derived from actual practice and maintained by those who do the work. Providing up-to-date reusable representations of community activity through an already existing piece of MGI infrastructure can complement and advance ongoing work in data ingest, research, and development. It can also provide a basis for holistic reuse and guidance since development processes occur for every part and whole component of MGI.

Figure 2 provides an example pattern language for sharing patterns of text processing that recur within and across various R&D projects. The example shows how specification and sharing of component, pipeline, and application patterns for similar text processing problems through a community-based pattern language registry can support community-based solution of common text processing problems. This figure presents a pattern language example which is inspired by a more systematic discussion of language implementation patterns [6]. The pattern and pattern-language concepts which are illustrated in this example apply more generally to many more kinds of patterns and processes which occur at many different levels across MGI or various domains. These can range from domain-specific patterns to architectural patterns, patterns of design, implementation, analysis, discovery, and more. Relative to this particular example, several of the patterns referenced on the left side of Fig. 2 are already instantiated in the CDCS as part of its core functionality for parsing, processing, and visualizing particular kinds of data formats. The applications on the right show examples of future work not only on our part, but also in the community itself as various individuals and projects work to define, share, instantiate, combine, and extend community-based patterns like these within a publicly evolved pattern language framework. The figure is given merely as one illustration of the approach. The scope of pattern language applications to MGI problems extends much further than this example.

Fig. 2
figure 2

Example text processing pattern language

The strategy of community-scale problem-solving through pattern languages was originally used to derive languages for design, implementation, and use that encode and generalize existing problem-solving practice, making it available for reuse and refinement by a community of individuals performing similar tasks. Process definitions, to be robust, inherently require definition of their data and data types, resulting in integration of existing data and semantics while also providing a foundation for expansion as needed. The capability to visualize, access, and reuse whole processes based on patterns is functionally equivalent to the benefits users realize when utilizing modern interactive travel maps for navigation, which are also maintained and updated based on community activity. Being able to see a whole map of possible and actual activities occurring in any part of MGI could enable future applications such as identification of gaps, evolving trends, opportunities, and more.

Pattern languages integrate well with robust modeling practices from complex modeling theory and model-oriented software development. Inspiration for this approach resulted from integrating the ideas of Peirce, Grenander, Alexander, Hintjens, Gershenfeld, and Chachra who provided deep insights into problem-solving patterns across science, mathematics, engineering, and technology at scale (Table 1).

Table 1 Insights from scalable problem-solving pioneers

A pattern-language approach to coordinated MGI process-model development can be summarized as follows:

  • Follow community-scale knowledge development processes for research and development (Peirce, Alexander, Hintjens)

  • Model ecosystem processes as pattern-languages describing the patterns of problems solved at each step. Show how these—similar to alternative sources, routes, and destinations in the larger context (map)—are derived from how research and development processes are actually executed (culturally grounded). Generalize them using a unified set of general patterns so they can be reused across different activities (Alexander, Grenander)

  • Formulate and integrate complex, technical scientific, technological, process, and data models via a common modeling representation that supports domain-specific problem-solving in a domain-independent way that can link problem-solving, patterns, and implementations (Grenander, Alexander, Hintjens)

  • Instantiate complex problem-solving process patterns in a unified manner as scalable communities and systems developed and used by large-scale communities (Hintjens, Alexander, Grenander)

  • Grow communities that can innovatively create new materials and technologies by following culturally grounded patterns of building (Alexander, Hintjens, Gershenfeld)

  • Create, use, and sustain large-scale infrastructure proactively repairing and evolving it over time, driven by innovative new materials and technologies (Chachra, Gershenfeld, Hintjens, Alexander)

To understand how such an approach could be realized on existing MGI infrastructure, we will now introduce the Configurable Data Curation System (CDCS) platform, its evolution, and its application toward solving this problem. In particular, we will take in the capabilities of the CDCS to create and link schemas together and we will be mindful of how CDCS capabilities may be leveraged to support exchange of pattern definitions and languages in a unified format.

Configurable Data Curation System (CDCS)

CDCS Essential Capabilities

Within the MGI program at National Institute of Standards and Technology (NIST), the Configurable Data Curation System (CDCS) infrastructure was created as an informatics platform to enable widespread data modeling, management, exchange, processing, and discovery. Built as composable web technologies, they have been used in a number of projects in order to support scientific research and development (R&D) processes that need to represent, validate, find, exchange, model, manage, transform, and visualize data in a number of ways. In addition, they have been used as components in R&D project workflows and their infrastructures of many kinds [32,33,34,35,36,37].

Figure 3 is intended to illustrate the following aspects of essential CDCS functionality:

  1. 1.

    A given CDCS instance is developed by a development team and customized to support the R&D needs of a given community. Its code, modular packages, and deployment configurations are finalized.

  2. 2.

    A CDCS deployment team deploys the instance using the prepared deployment configurations.

  3. 3.

    A community of researchers and developers work together to develop and share data, metadata, models, and applications that are relevant to their R&D efforts.

  4. 4.

    An identity provider may be used to authenticate users on the system.

  5. 5.

    Users may assign unique persistent identifiers (PIDs) to files and metadata in the CDCS using a service to provide those PIDs.

  6. 6.

    Users may use the CDCS core data, metadata, and services automatically via the CDCS REST API, its associated CDCS REST library (pycdcs), and any scripts, notebooks, or other REST-based tools users may wish to develop, share, or employ.

  7. 7.

    Users may also leverage CDCS functionality manually via the CDCS web-based user interactive interface since all core functionality is available through both interfaces.

Fig. 3
figure 3

Essential CDCS functionality

While it is possible that some readers may be inclined to try to equate CDCS to particular domain-specific approaches—such as integrated computational materials engineering (ICME)—it is important to view CDCS as being both distinct from but also complementary to such strategies. The architecture of CDCS is very general, and its uses are widely varied. It is increasingly used as infrastructural glue in many different R&D configurations. In essence, it can be understood as a medium for problem-solving rather than being characterized in terms of any particular approach. Figure 4 illustrates CDCS’s general scope by showing the ability for CDCS to be involved in every lifecycle phase of a research and development (R&D) effort, from its initial formation, through its various activities involving data collection, modeling, ingest, validation, management, or discovery.

Fig. 4
figure 4

CDCS infrastructure-based community problem-solving

CDCS Infrastructure R&D Projects

Over a decade ago, a number of communities and projects initiated MGI activities looking for ways to step through MGI-related developmental phases [2, 32]. This involved identifying—in manual or automated fashion—what resources and tools were available or needed, what kinds of activities needed to be performed to realize their R&D objectives, what aspects of their workflows needed automation and integration with resources and tools, and more. These involved not only close collaboration and integration between research teams and the CDCS team but also involved growth and development within teams, projects, and organizations. In turn, projects and activities support communities of potentially world-wide scope by performing, for example, additive manufacturing, for those developing new materials and requiring scientific resources or guidance. The CDCS infrastructure supporting these projects is available to external world-wide communities who can access them manually or automatically according to their needs, as well as to internal research and development communities who supply and develop resources, tools, and knowledge for advancement of materials science. This is exceedingly true for the growing number of CDCS-supported projects that support lifecycle engineering concerns in materials science and beyond. These include individual and interlinked repositories and registries for inter-atomic potentials research, phase-data-based materials science research, force-field organic and soft materials research, data-driven materials discovery and design research, materials science research resources, greenhouse gas research resources, circular economy research resources, additive manufacturing research resources, educational resources for science, technology, engineering, and mathematics (STEM), and more. It includes additional networks of knowledge and capability for international metrology research resources, COVID-19 informatics literature research, and research infrastructure for large-scale scientific image analysis [34]. CDCS registry infrastructure projects, in particular, have provided community-scale examples of knowledge-representation—in the form of community-centric schemas, controlled vocabularies, and semantic assets—as well as in schemas, CDCS system-to-system exchange, and community-organization around domain-specific efforts to acquire, organize, and apply domain-specific knowledge toward the achievement MGI-based objectives [35].

The CDCS infrastructure has enabled a number of efforts, both inside and outside of NIST, to rapidly prototype and pilot various ideas, projects, or systems. This ability to rapidly explore, even fail, and recover is an important trait in the rapid learning and evolution of systems, as noted by Alexander and Hintjens [15, 26]. Enabling this exploration at scale encourages innovation without fear of catastrophic failure and supports the conditions necessary for rapid exploration, which are essential to MGI’s mission for rapid discovery. A few of the projects that CDCS infrastructure has supported through rapid prototyping and piloting include the Digital NIST efforts at NIST for digitalization in metrology [38], the Satellite Calibration project for prototyping infrastructure to support low-earth satellite information, the metamaterials genome research project [39], the COVID-19 literature repository and registry to support COVID-19-related informatics [40,41,42], support for community-scale FAIRFootnote 1 container development in scalable scientific computing [43], and more.

A selection of CDCS community infrastructure examples is given in Table 3 of the appendix at the end of this paper. For more historical information about CDCS collaborations, see the NIST CDCS project page [34].

CDCS Scalability

CDCS [33] has grown and evolved by embodying many of the best practices of scalable and sustainable systems identified throughout this paper. It has remained adaptive and sensitive to its communities’ needs to ingest and represent (model) data. It has provided a basis for communities to flexibly model their data according to their needs, including all of the style, constraints, and unique preferences of a given project, community, or domain. CDCS has been based on foundational, scalable technologies and practices such as the following:

  • Choices for knowledge-representation of data allow users to create custom modeling-languages via Extensible Markup Language (XML) and JavaScript Object Notation (JSON) and to translate and convert those through extensions to any number of desired formats.

  • Web-services are provided based on Internet-scale representational state transfer (REST) software architectural style [4, 19].

  • System architecture and functionality are organized into modular systems of configurable packages enabling users to instantiate a number of patterns of functions and configurations to fit their workflow needs.

  • CDCS-provided support for a wide variety of project workflows, data management, and access control via workspaces, single sign on, and multi-factor authentication.

  • Systems for storage [36] and search [37] can be instantiated and composed at global scale, organized into federations, and integrated to enable exchange of resources between systems.

  • CDCS has continued to scale in its ability to become inter-connected with other processes, systems, or clusters, to enable resources and models to be leveraged through whatever tools are needed. This is possible through a set of possible middleware alternatives it supports (currently, Redis, Hintjens’ ZeroMQ [18], and possibly others in future).

  • CDCS has scaled its ability to provide bulk data loading capabilities in particular as well as automatic data migration capabilities for each release.

  • CDCS has increased its ability to support a wide variety of persistence layer options (Fig. 5), including an expanded set of database engines and file-system storage options in local, remote, and distributed data and file storage configurations.

  • CDCS has increased its ability to offer default modular deployment capabilities on major platforms (through Docker [44]) as well as scalable deployments (through Kubernetes [45]).

  • CDCS has also been committed to keeping pace with breaking trends and researcher needs to access and leverage advanced indexing, advanced text processing, and emerging tools, such as large language models (LLMs) [46], demonstrating the architecture’s flexibility.

Fig. 5
figure 5

CDCS persistence layer advances

CDCS represents a reference implementation of these scalable concepts and processes in its current web-application framework instantiation (Django). However, in the long-term, CDCS itself could be reimplemented over time under alternative stacks but with a common model (specification) allowing for long-term evolution of the kind demonstrated by Hintjens and explored by Chachra. CDCS is defined by its co-evolution with the communities it serves. It grows and scales continually to support the problem-solving needs and directions of its communities. Similar to Hintjens’ model of community development where a given development-cycle is a problem-solving process, the same is true for CDCS releases. These have occurred incrementally—every 6–8 weeks—for many years, occasionally resulting in major releases. When CDCS users face a known problem that has already been solved by CDCS infrastructure, reuse happens by their ability to immediately leverage existing resources and capabilities. When a new problem arises that has not been faced before, development (innovative problem-solving through design and implementation) provides a new understanding of the problem as well as new capability for solving it, which are extended to our growing number of communities through shared infrastructure.

CDCS Community-Scale Problem-Solving

CDCS and its participating communities problem-solve together. This is not merely through CDCS’s own efforts but also through individual and project-specific problem-solving. Researchers and projects often need to customize, extend, or apply CDCS configurations in new and different ways in order to achieve their goals. For example, CDCS has provided a REST application programming interface (API) from the beginning. However, it was through the creation and support of a particular client python library, pycdcs, that teams and projects were catalyzed in using CDCS automation through their existing Jupyter notebooks and scripts, integrated with data management practices and data science libraries [47]. Indeed, since REST is a language neutral interface [19], if anyone else in the community desired to create client libraries in potentially every other programming language, then the same kind of catalyzing usage could be extended to users of all programming languages. Hintjens’ ZeroMQ community achieved cross-language portability of their core functionality through community-contributed client libraries in exactly this manner. The fact that CDCS infrastructural communities are following this development pattern already is a good sign of community-driven scaling. Similarly, various projects have innovated within the patterns provided by CDCS.

A number of projects have explored various kinds of modular data models appropriate to their needs. Others have explored the use of CDCS persistent identifiers in order to create infrastructure-scale data-structures that can be referenced or composed, potentially, from anywhere on the Internet. This was the main idea behind persistent identification techniques, an idea that has been brewing in larger trends, such as in the Semantic Web for the entire World Wide Web, for some time [48]. The CDCS and MGI communities can gradually continue to knit together local and distributed data structures and can continue to grow a base of tools that can access and automate processing with them over time. The capability for supporting persistent identifiers in the CDCS is based on local and global infrastructure services. Locally, a CDCS can form and resolve internal unique identifiers. Globally, CDCS systems have been configured and deployed to connect to handle.net services for public unique identifiers and services. With PID services enabled on a given CDCS instance, users can uniquely identify and reference records, data objects, and data structures using unique persistent identifiers. The nature and scope of these identifiers and their usage are based upon the design of each application. And these designs and their access controls may increase or decrease the ability of some data to be accessed, connected, or used by other systems even if their data have assigned and interlinked PIDs. The CDCS has implemented support for persistent identifiers and their supporting services for some time. In the appendix, one can observe a number of existing CDCS projects that already use the persistent identification functionality. Nearly every new project uses it as a matter of course. While it will take some time for the full power of this capability to be realized, the infrastructure is in place and increasing use-cases for persistent identification continue to emerge through community usage. So far they include unique identification and tracking of experimental results from electron microscope lab information management systems (LIMS), unique scientific sample tracking, COVID-19 literature informatics, and more. All of these uniquely identified elements and structures may currently be manipulated and analyzed through automated REST-based analysis programs accessing distributed CDCS instances.

One of the important future use-cases for PID-related functionality—in the Semantic Web, in CDCS communities, and wherever PID approaches are applied—has to do with applications involving large-scale data integration, merging, or data fusion. Indeed, to move toward such objectives is to advance toward more scalable, more reliable interoperability of data, models, systems, and architecture over time. Identification is only one piece of that vast puzzle. The strongest possible identification is dependent upon the development, integration, maintenance, and evolution of strong, integrated, shared, and meaningfully grounded models. Such models provide a verifiable basis for interpretation regarding all other activities. Identification is merely a means for referencing aspects of these models as they are used in applications. This entire paper is focused on facilitating strong foundational modeling across scales. Best practices and capabilities in identification will emerge in tandem with good models. The models—as advocated by the patterns and pattern language strategies in this paper—provide the semantic basis and necessary foundation for robust identification.

Moving beyond persistent identification considerations, we now focus our attention on infrastructural concerns themselves. Just as Chachra observed that infrastructure inherently tends to form an underlayer upon which other services are built, this is happening increasingly with the CDCS infrastructure. As a result, additional cross-domain guidance will be needed to guide larger-scale design and development activities. Many of the individual ideas mentioned above may become increasingly useful as MGI workers start to look for unified ways to problem-solve through shared infrastructure over time. Perhaps through integration using shared problem-solving pattern-languages, they can start to codify their design and development process patterns in a natural and scalable way. Hintjens gave examples of how they achieved this scalability goal through scalable open-source software and systems by creating and reusing small, public specifications and protocols [49], as well as shared models and methodologies for creating, testing, and composing software projects, tools, and data over time [25]. In their communities, by following these approaches, they started to see more community-level engagement on meta-problems (pattern-level problems) where individual projects could start to have whole-infrastructure impact by plugging into community-scale processes for development: for example, when someone would develop support for a whole new protocol, format, or process, and then share that in a reusable way [26]. They even encoded their community development process as a model that they all agree to follow for consistency, integration, and evolution over time [27].

The CDCS infrastructure and communities have begun to see evidence of community-scale problem-solving already. These have been specific problems and solutions that have wider relevance than a single task or research objective and that may apply to many others. The pycdcs client library was one example. Another is the NexusLIMS project [50], which provides problem-solving patterns to interface communities with important instrumentation and to acquire, manage, and process data, equipment, and community users around that. In addition, the Sample Database tracking project provides an example of how to integrate digital and physical tracking of samples across NIST using persistent identifiers, QR Codes, and other techniques. Such capabilities can immediately be seen as kinds of recurring process-patterns (problems and potential solutions) that can arise in many contexts. These can provide a significant, reusable benefit to others through formalization of insights found in these efforts. Other projects find reuse gradually through initial interactions in shared teams—such as in the NIST Additive Manufacturing Benchmark Test Series (AM-BENCH) project [51]—where not only are there significant modeling and development occurring in service of the external additive manufacturing R&D community, but where there has been a rich internal exchange of ideas, schemas, tips, notebooks, scripts, and more over time, among related projects—such as with NIST’s Additive Manufacturing Materials Database (AMMD) project [52]. Indeed, within the highly collaborative project of AM-BENCH itself, one can see a microcosm of MGI, with its activities to engage and connect communities of metrologists, modelers, and developers across many scales, its interconnection of CDCS to increasing kinds of infrastructure, as well as in its inter-project, cross-domain, and multi-institution collaborations. Across CDCS projects, one can observe innovations ranging from creation or extension of components, to creation of specialized systems, services, custom schemas, novel uses of PIDs, and more. The innovations, in each case, are the result of development to fulfill existing needs. As noted by Hintjens and as demonstrated by CDCS itself, development is the result of solving a problem and sharing a new pattern of problem-solving capability. Many of these projects have their own project repositories for tracking their evolving code and data. As they grow and are refined over time, they reach a point in their development where they can be reused by others and integrated across the CDCS ecosystem.

CDCS Growth and Sustainability

The growth of the CDCS infrastructure usage, particularly at NIST, has resulted in a number of CDCS instances that support one or more projects and related communities. As is true for all infrastructure, this requires maintenance for repair and growth. Support for maintenance requires its own resources, considerations, designs, and costs. In a sense, we are returned to the lessons of form and function on yet another level: the organizational and ecosystem levels. This raises questions, which Chachra has explored at length [30]. These include questions such as: At or beyond a certain scale, who owns infrastructure? Who is responsible for it? Who takes care of it? Who pays for it? The MGI vision started, as many do, with an end-goal and application focus. However, as Chachra points out, infrastructure-scale problems are some of the largest-scale problems, designs, and costs many will ever face. Indeed, the whole reason that infrastructure is created in the first place is based on the idea that there is an inherent value achieved by combining and sharing resources for the good of the community. In the public sphere, for infrastructure such as electricity, roadways, and water, the public enters into a commitment to share the cost. Such infrastructural maintenance plans are organized, managed, maintained, and grown in a variety of ways, with some being better than others. The open-source world has not yet solved this problem in the large. But Hintjens’ and Alexander’s communities have provided a helpful model for moving toward sustainability from the standpoint that communities who use infrastructure are also those who help maintain it [3, 15, 26].

The concept of CDCS infrastructure maintenance and sustainability is a new problem that is the result of successful initial infrastructure development and deployment over the course of a decade. Similar to public infrastructure, the problem of maintaining it is being solved in various ways. Currently, some teams have localized resources in the form of on-staff expertise as well as localized funding to support their own team’s needs for development and maintenance. In other cases, there has also been, to date, a centralized and dedicated team of experts and related funding to support projects that do not yet have their own dedicated resources. Regardless of the organizational design, sustained support for infrastructural systems will be needed to ensure appropriate distribution and load balancing of labor and cost over time. From this perspective, the question of maintenance can become integrated into the lifecycle development process considerations in the future for each element of developed infrastructure. Maintenance planning can involve asking and answering system-level questions in a way that may need to cross traditional boundaries of organizations, teams, and domains. These can include questions such as: What communities and activities are being supported? What is the known and/or anticipated workload that has been experienced or anticipated over time? What problems are being solved (i.e., via what patterns) through this growth? What alternatives are available for solving those problems—now or in the foreseeable future? The answers pertaining to maintenance and sustainability of problem-solving infrastructure are as critical as the primary development processes themselves. As MGI and infrastructural development activities continue, those involved in maintenance processes should be encouraged to also weave their problem-solving processes into any shared knowledge base showing how sustainability problems are solved. Such problem-solving patterns could then be shared and reused across teams, organizations, and institutions across the MGI. Even though all of the knowledge-representation and sharing of those pattern-language problem-solving descriptions can be realized through existing CDCS infrastructure, the development and maintenance of that knowledge base would necessarily need to be a living and community-based activity as well. This is what Hintjens, Alexander, and others meant by a “living” socio-technical system: one that can adapt, learn, and function in its natural or built (infrastructural) environment. The degree to which such a system can effectively solve its problems—working together in an interoperable, unified manner—provides a measure of its ability to function, adapt, and evolve, at various levels of performance. This is directly related to its ability to organize and share its core ideas in a unified manner, expressed in meaningful terms [15, 26].

MGI Unification Exploration: An Ecosystem Problem-Solving Process Representation on CDCS

Having explored the problem of unification above as well as its implementation through pattern language and CDCS components, it is natural to consider what an MGI ecosystem might be like with such a problem-solving process representation in place. Understanding the history of the CDCS infrastructure evolution within MGI can provide some clues regarding how it could evolve next. Indeed, a key aspect of the approach described in this paper is that it includes not only materials lifecycle R&D processes in its pattern language modeling, but also the lifecycle R&D processes of the infrastructure and its sustainable evolution, too. From this perspective, perhaps the future of the MGI could have models that can provide a sustainable foundation for evolving infrastructure in a generative fashion.

When combining the many ideas introduced above, we can notice that a persistently unifying thread through them all is captured in the concept Alexander formalized as “pattern languages.” Alexander’s intention was that a user can use these pattern language structures at many levels of abstraction, equivalent to a modern experience of representing and navigating travel-routes interactively on maps. The map locations are analogous to patterns, and map travel routes are analogous to transformations from one patterned structure to another.

Such an approach can support the representation of structures and changes in space and time. Alexander applied them to architecture and building processes. Grenander innovatively applied configuration diagrams like these to structures of every possible kind, demonstrating their generality. A user could use these representations to navigate to different parts of a structure or process, examining their parts or wholes in as much or little detail as necessary. Such pattern language representations can stand independently of (and complementary to) domain-specific development practices. This allows pattern language elements to represent general, reusable abstractions of typical steps or processes, independent of the specific methodologies used to implement them. In this way, the patterns of pattern-languages can be used to achieve their primary purpose: to convey the general pattern of problems being solved at each step in the contexts where they occur. These problems can be of any type. In general, they are visualized and encountered as problems involving structure (form) and change (process, function). Due to their ability to abstract a given situation, they provide a way to deal with situations common to MGI, where many differences need to be unified across different problems, domains, cultures, structures, scales, and methods. Table 2 shows in more detail how pattern language features enable a number of desirable pattern language properties.

Table 2 Pattern-language features and their associated properties

Pattern languages were first formally described in physical architecture and then their use spread to software architecture and other domains [53]. Initial explorations in how to represent them digitally then gave rise to technologies such Wiki [22, 54] (and, thus, Wikipedia [23]). These abilities to provide a unified, generative language representation of MGI problem-solving patterns have the potential to create the kind of integrated, generative knowledge base aspired to by the MGI founders. The reason this is a true statement for MGI is directly related to the insight that is well known from architecture and engineering (“function follows form”) but which is also a deep foundational principle in complex systems and knowledge-representations. In science, engineering, modeling, and making of all kinds, individuals have noted that not only does Nature follow this principle (i.e., where to Nature “form = function”) but the best designs do, as well [55]. In his original pattern theory paper, Grenander noted that the reason that some processes have the capacity for advanced, complex information processing is due, inherently, to their “logical organization” (knowledge-representation, form) that is well suited for the most advanced pattern processing tasks they must perform (function) [11]. Alexander gives a visual demonstration of this concept in the ability of a system—or any problem-solving process—to be able to efficiently and effectively process, adapt, error-correct, or to developmentally innovate based upon its knowledge organization supporting the processes it must perform [15].

With CDCS’s current knowledge representation and persistent identification capabilities, it could support the community-based creation of a unified pattern language format. This could, in turn, be reused to express and integrate emerging resources and processes in every domain. If pursued to a significant degree, development of such a language could show a gradual emergence of such unified knowledge representations across various projects, processes, and domains. With such a structure in hand, one could treat them like a map, enabling a user to navigate the processes and types of each portion of the MGI at its current state of development, at varying scales of resolution of MGI-related knowledge and processes.

Conclusions

In conclusion, this paper described the need for a unified representation of MGI problem-solving processes. An approach for solving this problem in the form of pattern-languages was proposed and a means for realizing them through CDCS infrastructure was described. In doing so, the paper demonstrated that this team, as part of MGI, has been performing community-scale problem-solving in MGI infrastructure development. That infrastructure development began by adopting the starting assumptions and approaches of the MGI program which largely focused, in the initial phase, on data-specific concerns. Over time, through continual engagement and reflection upon MGI problems, the centrality of modeling of complex systems and their role in enabling advancement of key MGI goals through development process mapping became clear. As a result of this new understanding, we were able to leverage existing infrastructure work and integrate pioneering insights into an approach that shows potential as a strategy that supports and complements work to date while moving toward MGI unification. As part of that strategy, key lifecycle questions related to the importance of community-based modeling and sustainability were also considered. These ideas provide a lens for considering evolution of MGI itself. If successful, such solutions could catalyze movement toward MGI ecosystem-level unification, evolution, and sustainability in the face of growing complexity. What remains for next steps in our future work is to implement a pattern language schema on existing CDCS infrastructure for the definition, reuse, and evolution of commonly occurring problem-solving patterns. These will enable the formalization and extension of example patterns discussed herein as well as many others encountered by MGI R&D projects and communities.