Key words

1 Introduction

To make genomics data Findable, Accessible, Interoperable, and Reusable (FAIR) [1], it is necessary to have standards for describing the provenance of sequence data. Accurately recording information about factors like sequencing method and environmental conditions, referred to as metadata, allows for reanalysis, integrative meta-analyses, and accurate interpretation of results. The Genomic Standards Consortium (GSC) [2] is an open-membership working body formed nearly twenty years ago with the aim of supporting community-driven standards for sequence data. The primary standard produced by the GSC is the Minimum Information about any (x) Sequence (MIxS) [3], which allows researchers to capture extensive metadata on a per-sample basis. MIxS consists of a number of metadata elements (also called terms) that describe a particular characteristic of the sample or its source environment. These elements are attributes of different checklists (for describing sampling method and sequencing) and different extensions (for describing the source environment). The allowed values for each of these elements include free text, quantitative measurements, or value sets (picklists derived from different controlled vocabularies or ontologies, such as the Environment Ontology [4] for natural environments or the Uberon anatomy ontology [5] for metazoan host-derived samples). Most samples and their corresponding sequence data are described with a combination of a checklist and an extension.

Without metadata describing the environmental conditions, sample collection methods, or data generation approaches, (meta)genomic data would be meaningless [6]. As the volume and complexity of (meta)genomics data have dramatically increased and (meta)genomics has become a data-driven field [7], metadata provides the necessary contextual information for data use, reuse, and comparative analyses. Through the implementation of the MIxS standard across primary repositories, researchers are able to search and discover data of interest and perform comparative analyses such as correlating genes or functions of interest with environmental parameters. Further, as (meta)genomic catalogs across diverse samples and environments are generated, such as from human or other host-associated systems [8,9,10], soil [11, 12], and diverse aquatic habitats [13, 14], integration and synthesis will only be achieved through standardized metadata.

Here, we describe the components of the MIxS standard, discuss how to navigate specific ontologies that form the basis of mandatory terms, and provide an example of a soil metagenome collected as part of the National Science Foundation's National Ecological Observatory Network (NEON) continental-scale observation facility (https://data.neonscience.org/data-products/DP1.10107.001). We also provide updates on some recent developments in the evolution of MIxS that make the standard more FAIR and easier to use. While there are a variety of MIxS implementations for sample submission across the primary data repositories that form the International Nucleotide Sequence Database Collaboration (INSDC [15]) and other (meta)data platforms and knowledge bases [16, 17], the outlined methods aim to provide researchers with a practical approach to organizing their data from field sampling expeditions and an understanding of the terminology used in MIxS implementation. As MIxS is a community-driven standard, there are regular updates to terms on an approximately annual basis, and researchers are invited to contribute.

2 Overview of the Structure and Terminology of MIxS

The MIxS standard captures environmental information, sample collection methods, sample properties, nucleotide extraction method, quality, quantity, library preparation, and sequencing information, among other aspects. MIxS provides a number of terms (also called metadata elements) for describing these aspects of a sample. Some terms are generic and are applicable across all samples, while others are more specific to certain kinds of studies, environments, or sample collection methods. Examples of terms that are broadly applicable are depth, local environmental context, and geographic location (latitude and longitude). To ensure the MIxS standard is compliant with the FAIR guiding principles [1] and best practice for identifiers [18], all terms are assigned a resolvable, globally unique persistent identifier called a MIxS ID. For example, the term “depth” has the identifier MIXS:0000018, which is a compact uniform resource identifier (CURIE [18]). CURIEs expand to resolvable URLs by replacing the prefix (e.g., MIxS) with a web location (e.g., https://w3id.org/mixs/). In addition to the definitions, the MIxS terms have both a long descriptive title as well as a short, computer-friendly name called the “structured comment name.” An example is “altitude” (MIXS:0000094), which has the structured comment name “alt” and the title “altitude.”

There are two main components of the MIxS standard: checklists and extensions (previously referred to as “environmental packages”). These components are described below and outlined in Tables 1 and 2. Checklists and extensions are intended to be used in a combinatorial and modular manner. New checklists and extensions may be proposed and developed in coordination with the GSC community, stakeholders within the field, and the GSC Technical Working Group.

Table 1 The MIxS checklists. Checklists include metadata terms to minimally describe the sampling and sequencing methods. The six main checklists span genomes, marker genes, metagenomes, single amplified genomes, metagenome-assembled genomes, and uncultivated virus genomes. For marker genes and genomes, the checklists are classified under sub-checklists related to the organism or sequence as specified below. All checklists share the ten metadata terms listed in the right box. Additional type-specific descriptors not listed here are defined for each checklist and sub-checklist (https://genomicsstandardsconsortium.github.io/mixs/#checklists)
Table 2 The MIxS extensions. Extensions are a collection of context-specific terms developed by community experts to provide context about the sample and environment. The GSC describes an environment as any location in which a sample or organism is found. The extensions available currently are listed below with example terms specific to each extension. Additional information and a full list of terms is available through the GSC’s GitHub (https://genomicsstandardsconsortium.github.io/mixs/#extensions)

3 Checklists Describe Sampling and Sequencing Methods

A checklist is a collection of terms that minimally describe the sampling and sequencing method of a biological sample used to generate sequence data (https://genomicsstandardsconsortium.github.io/mixs/#checklists). Checklists include mandatory, recommended, and optional terms for specific types of sequencing data: genome, metagenome, marker gene, or more recently single-cell genomes, metagenome-assembled genomes, and predicted viral genomes [19, 20]. For genomic sequences (Minimum Information about any Genome Sequence, MIGS [21]), there are specific checklists for different taxa groups: Eukaryotes (EU), Bacteria and Archaea (BA), Plasmids (PL), Viruses (VI), and Organelles (ORG). Similarly, for marker gene sequences (Minimum Information about a MARKer gene Sequence, MIMARKS [3]), there are two checklists: Surveys (SU), which comprise sampling directly from environmental samples, and Specimens (SP), which are directly from cultured samples. There are ten mandatory terms that span all checklists: project name, sample name, taxonomy ID of DNA sample, geographic location (latitude and longitude), geographic location (country and/or sea, region), collection date, broad-scale environmental context, local environmental context, environmental medium, and sequencing method (Table 1). Beyond these ten mandatory terms, the type-specific checklists contain additional terms that have been developed in coordination with the corresponding research community.

4 Extensions Describe Sample and Sampling Contexts

Extensions (previously referred to as “environmental packages”) are collections of terms describing the specific environment, host, or context for a biological sample and are developed with domain experts (https://genomicsstandardsconsortium.github.io/mixs/Extension/). Extensions supplement checklists by providing additional terms to elaborate the context of the sample and/or sampling event (Table 2). For example, the soil extension has a number of terms to record attributes specific to soil environments, including soil depth (MIXS:0000018) for the measured vertical distance that a sample was collected and FAO class (MIXS:0001083) for soil classification from the FAO World Reference Database for Soil Resources [22]. Similarly, environment and study-specific terms like “history/agrochemical additions” (MIXS:0000639) are important for experimental designs in which fertilizers or other agrochemicals are applied to the field site. Extensions are used in conjunction with checklists, and together they form a “Combination.” For example, if a researcher generates metagenome sequence data from a soil environment, the appropriate combination would be MIMS and the soil extension. A detailed description of this combination is provided below in the section, A Primer on Using MIxS: The MIMS Checklist and Soil Extension (Subheading 8).

5 Use of Ontologies and Value Sets

The use of ontologies in MIxS supports the standardization of terms, allowing different datasets to be combined and compared (an example is provided below in the section, A Primer on Using MIxS: The MIMS Checklist and Soil Extension (Subheading 8)). Ontologies also allow the submitter to describe values at the appropriate level of granularity. To standardize the use of categorical values, MIxS makes use of both ontologies and value sets for some terms (Table 3). For example, “host body site” (MIXS:0000867) can take values that are terms from the Uberon multi-species anatomy ontology. In some cases where there is no standardized ontology, a value set, i.e., a small set of enumerated values, is provided as an option. As another example, the term “host cellular location” (MIXS:0001313) takes a value set/enumeration that restricts the permissible values of “host cellular location” to “extracellular,” “intracellular,” or “not determined.” In the future, these value sets may be mapped to ontologies.

Table 3 Examples of ontologies used in MIxS. A set of illustrative examples (not comprehensive) for ontologies used in MIxS together with example terms that demonstrate their usage and example values

When ontology term values are provided in MIxS, the standard requires that these be written using “termLabel [termID]” syntax, where the label is followed by the unique identifier in square brackets. This allows for both human readability as well as the best-practice use of identifiers. All ontology identifiers are prefixed identifiers (also known as CURIEs [18]), with the prefix registered in the bioregistry [23], and most of the ontologies belong to the Open Biological and Biomedical Ontology (OBO) Foundry [24]. MIxS uses ontologies that are openly available and can be browsed in standard ontology web portals such as BioPortal [25], OLS [26], or OntoBee [27]. When browsing these terms using these standard browsers, it is possible to see terms in the context of other terms, alongside their textual definitions, which makes it easier to select the correct term.

6 MIxS Versions

MIxS is updated as new terms are suggested by domain experts or errors are found in the current version. MIxS follows the three-part Semantic Versioning practice [28], in the format X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version. A major version is released whenever new checklists, extensions, or terms are added. This requires approval from the GSC Board and major external stakeholders, such as INSDC and the Genomes OnLine Database for subsequent adoption [17]. The target frequency for major updates is roughly once per year. A minor version is released when errors have been fixed or refinements have been made without adding new checklists or extensions. Minor versions may include updates to terms or new terms that do not break any existing use of checklists and extensions. Minor version updates require review from the GSC Compliance and Interoperability Group. Patch versions are released when infrastructural changes are made to the MIxS code repository or to fix grammatical or spelling errors without making functional changes to MIxS content. Patches require review from the GSC Technical Working Group. The MIxS standard has moved to being hosted on GitHub, allowing full tracking of changes and the ability to easily retrieve older versions.

7 Methods

7.1 How to Access the MIxS Standard

There are multiple ways to explore the MIxS standard. The web-based documentation at https://w3id.org/mixs has been optimized for exploring the large number of MIxS terms, checklists, and extensions. Resources in various data modeling frameworks (JSON-LD, JSON Schema, OWL, SQL) are also provided for computational users in the MIxS GitHub repository, specifically at https://github.com/GenomicsStandardsConsortium/mixs/tree/v6.2.0/project. Since MIxS is written in LinkML, other technical representations could be added in the future. Recently, in collaboration with the National Microbiome Data Collaborative (NMDC [16]), the GSC updated the underlying representation of MIxS to use the Linked Data Modeling Language (LinkML [29]), which is expressed in the YAML format. This YAML representation also allows MIxS to be automatically converted to different formats via the LinkML library for use in different tools by computational users. For example, there is a JSON Schema version of MIxS that allows data in JSON format to be validated. There are also semantic web representations, such as a Web Ontology Language (OWL) representation, which can be used in ontology browsers or editors like Protege.

As described previously, all MIxS terms are assigned a resolvable, globally unique persistent identifier that resolves to a page with full details about the term, including (1) both a structured comment name and a title; (2) a description of what the term represents and how it should be used; (3) which checklists and extensions the term can be used with; (4) the allowed values for that term; and (5) additional information of interest. An example is shown in Fig. 1.

Fig. 1
A screenshot of M I X S terms. a. a. M I M S checklist and soil extension. The third option, reference, or method used in vegetation classification is squared. b. Current vegetation method. It lists the properties below including range, cardinality, structured pattern, and regex pattern.

Screenshot of the MIxS documentation for the MIMS checklist and soil extension (“MimsSoil,” https://genomicsstandardsconsortium.github.io/mixs/MimsSoil/) showing (a) a section of the combination “MimsSoil” documentation with a subset of applicable terms and (b) an individual term page that includes the term description, URI, and properties

Across the INSDC, the MIxS standard is made available in different forms. The ENA provides XML downloads for all extensions (https://www.ebi.ac.uk/ena/browser/checklists), and the NCBI BioSamples database provides both XML and Excel downloads of each extension and checklist combination https://www.ncbi.nlm.nih.gov/biosample/docs/packages/. Further details for INSDC data submission are provided below.

7.2 How to Use the MIxS Standard for Data Use, Reuse, and Analysis

Standards are necessary to ensure data are interoperable and can be combined with data from other sources. By providing a standard set of data descriptor terms, together with constraints on how these can be used, MIxS allows different genomic datasets from multiple sources, including microbiome datasets, to be combined in meta-analyses. The ways in which MIxS is used vary depending on the database implementing it. Some resources, such as the NMDC Data Portal [16], allow for faceted search using a selected subset of MIxS terms (Fig. 2).

Fig. 2
A screenshot displays the M X S standard for data use, reuse, and analysis. The right pane includes a list of sample options. The left pane displays the depth measure.

Data discovery using MIxS terms. An example of metadata search using the MIxS term “depth” on the NMDC Data Portal (https://data.microbiomedata.org/). In the NMDC Data Portal, MIxS terms act as facets for searching across integrated multi-omics studies and datasets

When downloading biosample data in bulk, MIxS terms may appear as column headers in tabular data downloads, as XML elements (NCBI BioSample), or JSON objects (NMDC). Note that additional processing may be required to make data comparable. Most MIxS measurement fields are string values containing both a numeric value and a unit (and in some cases, ranges may be allowed). These may need to be parsed before quantitative analysis. MIxS does not currently mandate the use of any one unit for measurement fields such as “depth,” which means that one study may measure depth in centimeters and another in meters, so it may be necessary to do basic unit conversions.

Many databases do not enforce all MIxS constraints, or they may have data that predates certain constraints. This means that some fields may have erroneous or ambiguous values. Care should be taken with any analysis, and decisions on how to clean or normalize the data need to be made on a case-by-case basis. An additional issue is data sparsity: although MIxS contains a lot of terms to describe various aspects of a sample, in practice, very few are commonly used, and thus analyses will need to handle missing data appropriately. As awareness of MIxS increases and data validation tools such as the NMDC Submission Portal (https://data.microbiomedata.org/submission/home) become more widespread, the quality and completeness of sample metadata should increase, enabling more powerful meta-analyses.

7.3 How to Use the MIxS Standard for Data Submission

The GSC works across public databases and primary repositories, namely the International Nucleotide Sequence Database Collaboration (INSDC, https://www.insdc.org/ [15]; comprising the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan; the European Nucleotide Archive (ENA) at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK; and GenBank at the National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health in Bethesda, Maryland, USA). The INSDC integrates the MIxS standard, updating their nucleotide sequence and BioSample resources (NCBI Packages: https://www.ncbi.nlm.nih.gov/biosample/docs/packages/; EBI BioSamples https://www.ebi.ac.uk/biosamples/; EBI-ENA standards: https://www.ebi.ac.uk/ena/browser/about/data-standards; DDBJ: https://www.ddbj.nig.ac.jp/biosample/submission-e.html) to utilize the latest major release, ensuring backwards compatibility with previous MIxS versions and providing MIxS compliant metadata templates (NCBI: https://submit.ncbi.nlm.nih.gov/biosample/template/; ENA Sample checklists: https://www.ebi.ac.uk/ena/browser/checklists) to enable the selection of GSC checklists and extensions.

Form-based interfaces are used with different implementations across the INSDC repositories to collect information about samples using MIxS. Typically, the submitter is asked to provide both checklist and extension, and the combination of these determines which terms are provided, and what the constraints on these terms are. Data submission to EBI can be made via WebIn (https://www.ebi.ac.uk/ena/submit/webin/), while NCBI offers a different online submission tool (https://submit.ncbi.nlm.nih.gov/biosample/template/). DDBJ also offers the MIxS checklists as pre-formatted template files to be uploaded to their submission portal D-Way (https://ddbj.nig.ac.jp/D-way/). Similarly, for other databases such as GOLD, the implementation is through a web interface where a submitter chooses a MIxS checklist and extension. These different implementations are supported by expert curators to validate terms and ensure compliance. Alternatively, the NMDC uses a specialized data submission tool called DataHarmonizer [30], which provides real-time validation to users and aims to lower barriers for metadata submission (Fig. 3).

Fig. 3
A screenshot displays the home page of the import X L S X file. It includes sample I D, source material identifier, analysis or data type, sample linkage, and broad-scale environment context. The left page displays column help options including column name, description, guidance, and example.

Screenshot of the NMDC Submission Portal (https://data.microbiomedata.org/submission/home), which uses DataHarmonizer [30]. EnvO terms are shown as dropdowns for value sets along with validation checks for terms with measurement fields

The different submission systems available can benefit the larger research ecosystem by providing resources via familiar settings and user interfaces for a variety of different users. However, the multiple interfaces may also present challenges for researchers who may not be familiar with the MIxS standard, as they offer multiple interpretations of the same terms. Regardless of the method used for submitting metadata, it is useful for researchers to be aware of the MIxS standard to help prospectively record the required information. For example, when carrying out a study involving soil biogeochemical analysis, we recommend reviewing the MIxS soil extension (see below) to ensure that all measurements can be mapped to a term and to plan for capturing these measurements prospectively, so they can be easily included as part of a submission.

7.4 How to Specify Sample Environments Using the EnvO Ecosystem Classification

The MIxS standard uses many different ontologies for different terms, as described previously. For environmental samples, the key ontology is the Environment Ontology (EnvO [4]), a community-led domain ontology that represents diverse environments and aims to promote standardization and interoperability through concise, controlled descriptions of environment types across several levels of granularity. It also ensures that datasets described using EnvO terms can be more easily integrated and analyzed in a reproducible manner. Since the meanings of the terms are precisely defined and accessible, humans and computers can easily connect EnvO terms across datasets. EnvO also serves as a bridge to other standards and vocabularies in the environmental sciences, including mappings to the SWEET vocabulary [31].

MIxS uses EnvO as a set of three mandatory terms in all extensions to specify the biome, environmental feature, and environmental material, colloquially referred to as the “EnvO triad.” These three terms are described as follows:

  • Broad-scale environmental context (MIXS:0000012): The major environmental system (e.g., EnvO’s biome) that the sample or specimen derived from. The biome identified should have a coarse grain, meaning this is the largest breadth of a general environment from which the sampling was done. For example, the terrestrial biome is defined as “a biome which is primarily or completely situated on a landmass,” ENVO:00000446.

  • Local environmental context (MIXS:0000013): A more direct expression of the sample or specimen’s local vicinity, which likely has a significant influence on the sample or specimen. Taking the above terrestrial biome sample, a local environmental context could be an area of evergreen forest which is defined as “an area of a the planet's surface which is primarily covered by a forest in which the majority of trees maintain their foliage despite seasonal change,” ENVO:01000843.

  • Environmental medium (MIXS:0000014): The environmental material(s) immediately surrounding your sample or specimen prior to sampling. Subclasses within EnvO’s environmental material class (http://purl.obolibrary.org/obo/ENVO_00010483) should be used as values for this term. Using the previous example, a soil sample collected from an evergreen forest would simply use the environmental medium soil, defined as “environmental material which is primarily composed of minerals, varying proportions of sand, silt, and clay, organic material such as humus, interstitial gases, liquids, and a broad range of resident micro- and macroorganisms,” ENVO:00001998.

For host-associated samples, terms from a relevant anatomy ontology (UBERON for animals and PO for plants) can be used for the local environmental context. EnvO provides a detailed description with usage notes and general considerations (https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS) to further guide researchers.

8 A Primer on Using MIxS: The MIMS Checklist and Soil Extension

To illustrate the usage of MIxS, we outline below an example sample from NSF's NEON soil collection (https://data.neonscience.org/data-products/DP1.10107.001) that was sequenced through the Department of Energy’s (DOE) Joint Genome Institute under Award DOI 10.46936/10.25585/60008738. Since this is a soil metagenome sample, the MIMS checklist is used along with the soil extension with the full set of terms in the combination (https://genomicsstandardsconsortium.github.io/mixs/MimsSoil/). The soil metagenome sample name is “Terrestrial soil microbial communities from Great Basin, Onaqui, Utah, USA - ONAQ_008-M-20210524-comp-1.” This example is available through NCBI’s BioSample resource (https://www.ncbi.nlm.nih.gov/biosample/SAMN37862680), along with records available through the NMDC Data Portal (https://data.microbiomedata.org/details/sample/nmdc:bsm-11-357mga60) and GOLD (https://gold.jgi.doe.gov/biosamples?id=Gb0356145). Figure 4 shows the soil metagenome example deposited in NCBI’s BioSample repository and how metadata has been populated to conform to the MIxS standard.

Fig. 4
A screenshot displays the National Library of Medicine options. It includes identifiers, organisms, packages, attributes, description links, submission, and accession.

The soil metagenome sample “Terrestrial soil microbial communities from Great Basin, Onaqui, Utah, USA - ONAQ_008-M-20210524-comp-1” deposited in NCBI’s BioSample repository with MIxS compliant metadata using the combination of the MIMS checklist and soil extension

The MIMS checklist, together with the soil extension, contains a combined total of 97 terms to describe the soil environment and context for a given sample (https://github.com/GenomicsStandardsConsortium/mixs/releases/tag/v6.2.0). Of these 97 terms, the cardinality indicates whether the terms are mandatory, recommended, or optional according to LinkML syntax documentation (https://linkml.io/linkml/schemas/slots.html#slot-cardinality). Accordingly, Table 4 provides the mandatory and recommended terms for this example, although we note that not all recommended terms have been submitted. Additionally, optional terms for additional metadata such as pH, soil horizon, and water content are shown. Term descriptions and formatting guidelines are provided at https://genomicsstandardsconsortium.github.io/mixs/MimsSoil/, and adhering to this standard ensures metadata is interoperable and machine-readable.

Table 4 The soil metagenome sample “Terrestrial soil microbial communities from Great Basin, Onaqui, Utah, USA - ONAQ_008-M-20210524-comp-1” with MIxS compliant metadata using the combination of the MIMS checklist and soil extension. Mandatory terms are indicated with the cardinality 1..1, while 0..1 are recommended terms

As previously mentioned, the use of ontologies in MIxS supports the standardization of terms and can be leveraged for comparative (meta)genome analyses. An example using the MIMS checklist and soil extension is demonstrated in the NMDC Data Portal with the term environmental medium (MIXS:0000014), which can be used to identify samples across diverse soil types (Fig. 5). In this example, samples from pasture soil (ENVO:00005773), tropical soil (ENVO:00005778), meadow soil (ENVO:00005761), grassland soil (ENVO:00005750), and alpine soil (ENVO:00005741) can be identified and selected for comparative analyses across four separate studies and 187 biosample records. Using these EnvO terms, researchers have the ability to search and access data across diverse soil types for downstream meta-analyses of these metagenomes.

Fig. 5
A screenshot displays the environmental ontology page. The right pane displays the active query terms including environmental medium. The middle page displays the environmental medium under the omics option. A map layout is provided on the left screen.

An example of how using the Environment Ontology (EnvO) can support comparative analysis. Using the environmental medium (MIXS:0000014) term in the NMDC Data Portal, researchers can search across diverse soil types like pasture soil (ENVO:00005773), tropical soil (ENVO:00005778), meadow soil (ENVO:00005761), grassland soil (ENVO:00005750), and alpine soil (ENVO:00005741). These search results identify four separate studies that could be used for a comparative metagenome analysis

9 Discussion

9.1 How to Contribute to Future Development of the MIxS Standard

The GSC engages researchers from around the globe and welcomes feedback from the community on every aspect of the MIxS standard. Researchers can directly submit GitHub tickets to the GSC GitHub issue tracker (https://github.com/GenomicsStandardsConsortium/mixs/issues) to propose changes or request new checklists, extensions, or terms; to correct errors in the standard; or to ask questions on how to apply MIxS.

Although an individual may submit requests for new or updated MIxS checklists or extensions, additions or updates are generally created by a community working within a specific research area or with a specific type of genetic or genomic data. Sometimes, a community may simply need to add new terms to an existing extension or checklist, or they may need to create an entirely new one that reuses some existing terms but requires many new terms to cover a new topic area. Often, research communities reach out to the GSC after having identified a set of metadata terms they need, but we encourage them to reach out to the GSC as early as possible to co-develop the expansion. To do so, community representative(s) should contact the GSC Compliance and Interoperability Group to set up an initial consultation. This can be done via the MIxS GitHub repository issue tracker or by emailing directly at gensc-cig@googlegroups.com. The GSC will guide the community through the process, but, in general, it will involve the following steps:

  1. 1.

    Complete a project proposal using the template (https://www.gensc.org/pages/projects/gsc-project-description-template.html).

  2. 2.

    Select any appropriate existing MIxS terms from the extensive catalog of terms already defined (https://genomicsstandardsconsortium.github.io/mixs/term_list/).

  3. 3.

    Propose any new terms required by the community for the new checklist or extension using the GitHub issue tracker template (https://github.com/GenomicsStandardsConsortium/mixs/issues/new?template=term-request.md).

  4. 4.

    If any existing terms need refinement of their examples, required or recommended rules, or comments for use in the new project, those changes should also be submitted as GitHub issues.

  5. 5.

    The GSC’s Compliance and Interoperability Group will review requests and liaise with the extended community to ensure that there is a legitimate need for the new checklist or extension and that terms are appropriately defined with suitable expected values.

  6. 6.

    Once consensus is reached between the community representatives and the GSC, the new terms will be incorporated into a MIxS release candidate for review by the GSC, repositories that implement GSC, and the general public.

  7. 7.

    New checklists, extensions, and terms only become officially part of MIxS once they are approved as part of a major release. Communities may begin to use them and their identifiers prior to the release, but must be aware that they are subject to change until approved by the broader GSC community.

As a community standard, the GSC is committed to including and incorporating feedback and providing updates to meet community needs. The NMDC has helped facilitate these updates through user research. Upon discussion with subject matter experts, some terms, such as “climate environment” (MIXS:0001040), have been identified as being ambiguous or redundant and will be deprecated in a future version of MIxS. Additionally, concerns have been raised about the widespread tolerance of open-ended units for some terms, like depth (MIXS:0000018). Based on this feedback, some terms will be identified as requiring values in specific units in future MIxS releases (for example, depth will be required in meters). This change will improve interoperability, ensure consistent capture, lower confusion, and improve machine readability.

Outreach and community collaboration are supported through the GSC annual in-person meetings, which facilitate discussions among GSC board members, event attendees, and local researchers. These meetings consist of updates from the GSC and its working groups, talks centered around implementing standards, and workshops aimed at promoting practical skills for adopting best practices in standards. Themes for each annual meeting are devised to facilitate discussion and solutions to burgeoning data standards needs. Toward that aim, the GSC rotates the annual meetings among areas of the world, providing opportunities for engagement between the GSC and diverse local researchers and students.

Community members are encouraged to join two GSC-led working groups. The Compliance and Interoperability Group meets virtually monthly to discuss proposed changes to MIxS checklists, extensions, or specific terms with a focus on biological topics. The Technical Working Group meets twice a month and is focused on technical implementation and software development of the GSC standards, such as LinkML and ontologies. Both working groups are open to new participants, regardless of familiarity with the GSC, technical expertise, and level of continued participation. The GSC Google Group (https://groups.google.com/u/0/g/genomic-standards-consortium/about) is maintained for GSC members and the larger community. It distributes GSC-related emails, and the group provides information on upcoming meetings and provides a place for discussion on standards and GSC activities. To request to join one of the working groups, please send a message to the GSC Google Group.

The GSC continues to emphasize community engagement and educational initiatives across genomic research communities. Participation in national and international genomic research conferences (Intelligent Systems for Molecular Biology [ISMB], International Society for Microbial Ecology [ISME], American Society for Microbiology [ASM]) provides opportunities to connect with diverse genomic researchers. The breadth of environments studied and groups performing genome sequencing has continued to grow over the past two decades, and the GSC strives to engage new communities to support their growing metadata standardization needs and to facilitate ever greater reuse and discoverability of genomic datasets.

9.2 Beyond Sequence Data Standards: Metabolomics and Proteomics

Community-supported consortia have formed around developing standards for additional data types beyond (meta)genomics data. For example, researchers published several articles on establishing standard reporting requirements for metabolomics data in a formative volume of the journal Metabolomics [32], and the Human Proteome Organization (HUPO) developed a number of modules (https://www.psidev.info/miape) for reporting the minimum information about a proteomics experiment [33]. However, in the years since their development, the challenges associated with maintaining these standards and supporting community adoption have become clear. The Metabolomics Standards Initiative’s ontology and reporting standard (https://github.com/MSI-Metabolomics-Standards-Initiative) are not regularly updated, and analysis of data repositories revealed poor compliance with the established metabolomics standards [34, 35]. Similarly, HUPO’s proteomics standards have not been updated for a number of years.

The GSC has engaged members of the metabolomics community to begin exploring potential improvements to the standard terms, ontologies, and reporting formats that could be implemented to encourage broader adoption, particularly among the growing community of researchers generating metabolomics and genomics data from the same sample (i.e., multi-omics). In practice, the convention established by the GSC will be followed, such that a researcher would combine an extension and a checklist to provide context about their sample and the descriptors and standard terms necessary for a metabolomics experiment, respectively. The metabolomics checklist(s) will leverage previous Metabolomics Standards Initiative efforts and will encompass descriptors for sample processing (e.g., solvent extraction, derivatization), instrument analysis (e.g., chromatography separation, ionization source), and data processing (e.g., normalization, peak picking).

9.3 Partnerships and Alignment with Other Standards

The GSC is committed to partnering with other programs and initiatives to further the reach of the MIxS standard and lower barriers to adoption. The GSC has partnered with the NMDC to host training events and communicate the value of microbiome data standards. As part of this collaboration, the NMDC provides extensive MIxS and data standards training to the annual cohort of NMDC Ambassadors, who are then tasked with hosting their own workshops and events to distribute this information and provide hands-on experiences with MIxS for microbiome data [36]. The GSC is planning to extend these activities with additional partners, such as the NFDI4Microbiota consortium (https://nfdi4microbiota.de/), which also offers a range of training courses relating to FAIR microbiome data generation, processing, and deposition. The GSC also meets regularly with the human biomedical-centric organization Global Alliance for Genomics and Health (GA4GH, https://www.ga4gh.org/) to ensure that our work on different standards is aligned and non-redundant.

Due to its broad scope, covering sequencing and other omics techniques, sample preparation, and sample sources from natural environments, human and animal subjects, experiments, and manufactured products, there is naturally overlap between MIxS and other standards in domains such as genomics, environmental science, and biodiversity. The GSC’s strategy is to provide robust curated mappings between these standards using frameworks such as the Simple Standard for Sharing Ontological Mappings (SSSOM) [37]. This strategy is exemplified by recent work to align MIxS with standards used in biodiversity informatics. The leading standards body is the Biodiversity Information Standards Group (TDWG, https://www.tdwg.org/), which produces the Darwin Core standard [38] used by the biodiversity community in databases such as the Global Biodiversity Information Facility (GBIF, https://www.gbif.org/). The GSC has signed a memorandum of understanding with TDWG which states a commitment to maintaining a shared mapping between the two groups’ vocabularies (MIxS and Darwin Core) [39].

Other standards bodies of relevance include the Global Alliance for Genomics and Health (GA4GH) [40], which publishes the Phenopackets standard for representing metadata about patients and genomics research subjects [41]. Although the emphasis is on healthcare and research, there are many aspects where this standard can relate to MIxS host-associated extensions, including the representation of samples and their source tissues, the disease status of the source, as well as drug exposures or therapies. Aligning these and other standards in the clinical domain has yet to commence but will be valuable for the global interoperability of genomics data.

10 Conclusion

This chapter provides a practical overview of the MIxS standard with the aim to support future use and development for FAIR comparative (meta)genome analysis. The structure and terminology presented derive from the most recent MIxS version 6.2 and will continue to evolve with future versions as a community-driven standard. We encourage the research community to continue working with the GSC and partner organizations like the NMDC to champion the use of standards to enable data discovery and research innovation.