CONTENTS

Introduction

1. Computing and Network Infrastructure and Data Storage

2. Software and Interface

3. Data Deposition

4. Analytical Functionality

5. NGID Model

Conclusions

INTRODUCTION

A key object of scientific infrastructure is genetic databases, which provide the storage and analysis of genetic data for science and various sectors of the economy. Currently, there are more than one and a half thousand databases of genetic information in the world, including both specialized databases for research and databases of genetic information of the population [1].

In accordance with the order of the President of the Russian Federation no. Pr-920 dated June 4, 2020, the National Research Center “Kurchatov Institute” is creating the National Genomic Information Database (NGID). In 2022, amendments were adopted to Federal Law no. 643-FZ dated December 29, 2022 “On State Regulation in the Field of Genetic Engineering Activities,” according to which, at the NGID, “information is necessarily provided by genetic data holders who carry out genetic-engineering activities, the production and/or supply of genetically modified organisms, production and/or supply of products obtained with the use of genetically modified organisms or containing such organisms, government agencies, other legal entities and individual entrepreneurs engaged in molecular genetic analysis for the purpose of conducting examinations, tests and research works” [2].

The NGID should solve the following tasks:

(i) storage of genetic information of a whole variety of BioSamples, including plants, animals, wildlife microorganisms and ecosystem metagenomes, plants and animals for agriculture, microorganisms for industry, humans, as well as pathogenic microorganisms;

(ii) classification of genetic information;

(iii) search (by metadata and by homology);

(iv) provide the visualization and analysis of genetic data in an integrated genomic browser;

(v) provide access to high-performance and cloud computing infrastructure for processing and analyzing genetic data;

(vi) provide the possibility of designing analysis tools (“pipelines”);

(vii) provide work with analysis tools based on machine-learning technologies;

(viii) provide a communication environment (social network) for professional communities;

(ix) facilitate the publication of scientific articles;

(x) integration with departmental systems in terms of providing access to genetic data;

(xi) integration with international databases;

(xii) accumulation of data obtained during the implementation of the Federal Scientific and Technical Program for the Development of Genetic Technologies for 2019–2027 (Fig. 1).

Fig. 1.
figure 1

Functional structure of the NGID.

It is planned that the NGID will become a key element of the genetic research and development infrastructure in Russia, providing the storage, integration, and analysis of genetic data obtained by Russian and foreign organizations and researchers.

1 COMPUTING AND NETWORK INFRASTRUCTURE AND DATA STORAGE

Computing Infrastructure

The architecture of the NGID, the software, and hardware should provide an acceptable speed for performing operations related to the search, visualization, analysis and high-performance processing of genetic information using machine-learning technologies and supercomputing. To ensure the protection of information, the NGID has identified three information-processing loops: open, confidential and special. The software and hardware architectures of each of the circuits are similar, but there are differences due to the requirements for information protection. The information and computing infrastructure of the NGID is based on the following solutions:

(i) a distributed storage system for working with genetic data;

(ii) a computing cluster for running applications and services based on containers;

(iii) a system for managing computing resources, data storage, and computing clusters;

(iv) tape system of long-term storage.

The distributed storage system provides the operation of the file system and object storage. Object storage is based on a nonhierarchical storage structure with access to objects through unique identifiers, in which data is stored as objects. This solution provides ample opportunities for storing metadata, organizing data access, and provides efficient scaling. The object repository of the NGID will contain genetic data deposited by users, as well as data obtained during the import and exchange of information with external databases.

Cloud technology is used to manage computing resources, clusters, and data storage in open and confidential circuits. This solution allows flexible management of the infrastructure of a computing cluster, the implementation of increased isolation of information-processing segments, and balancing of the load between clusters. In the open and confidential circuits, the NGID uses an adapted environment for working with containerized applications in the cloud infrastructure, certified for compliance with CNCF (Cloud Native Computing Foundation) standards and information-protection requirements in accordance with applicable law. Such an environment will provide an extended application programming interface for solving the following tasks: creating, configuring, and deleting disks, load balancing, managing external networks, setting up security groups, etc. This solution simplifies the maintenance of computing facilities. The performance estimate of total computing resources, taking into account recommendations for resources for high-throughput sequencing based on time measurement for typical processing operations, is 30 000 computing cores [3]. To use analytical programs that have increased requirements for computing resources, in particular for calculations based on graphics accelerators and operations that require a large amount of memory, a high-performance computing cluster is provided as part of the NGID open circuit. To ensure the safety of data for a long time period, a tape storage system will be used.

Supercomputer resources are used as high-performance computing infrastructure. To ensure the storage of genetic data, the system includes two storage systems: a storage system for I/O intensive tasks and a distributed long-term storage system for the archive of genomic data, based on hard disk drives. In creating the computing infrastructure, the main emphasis is placed on ensuring the fault-tolerant operation of the NGID and reducing maintenance time. NGID infrastructure components at all levels use hot standby; containerization of software applications is used [4].

Planned Volumes of Information Storage

The total size of the NGID storage is determined based on the size of storages of existing international databases. According to the European Bioinformatics Institute (EMBL-EBI), in 2021 the volume of stored data exceeded 390 PB and continues to increase [5]. Today, in the international databases of the International Nucleotide Sequence Database Collaboration (INSDC), about 10 PB of open genetic data are published, a comparable amount of closed data is reported to be stored, and several petabytes of new data are published annually [6]. Considering that the volume of publicly available genetic data that can be used by a wide range of researchers is much higher (for the Sequence Read Archive of the National Center for Biotechnology Information (SRA NCBI), the volume of publicly available data is about 45 PB) and there is a need to provide redundancy in data storage, the NGID storage with a useful volume of 50 PB is currently being considered [7].

An important feature of genetic data is the small average file size, ~1 MB. Thus, 1 PB of genetic data includes about a billion files, which greatly exceeds the capabilities of modern storage systems. Therefore, when developing the NGID, measures were taken to solve this problem: genetic data is combined into archives before being placed in a storage system, data services extract files from archives directly, and an object storage system is used that allows the arrangement of many files without performance degradation.

Data Model

To organize the data, a modified approach is used, which was implemented earlier using the example of the INSDC, one of the most famous global initiatives in the field of genetic-data exchange, which was formed in the early 1980s [6]. INSDC members have created a global, comprehensive public domain collection of nucleotide sequences and related metadata. The data ranges from raw reads, and genomic assemblies and alignments to a variety of functional annotations. The regular exchange of data, standardized formats and, increasingly, the exchange of technology ensure global synchronicity in collaboration. Due to the above advantages, the hierarchy of metadata and their standards are used in databases not included in the INSDC, in particular, in the National Genomics Data Center (NGDC), China [8].

The top level of data organization is the BioProject, which simplifies the classification and systematization of data and metadata for all INSDC participants. A BioProject is mainly metadata about research projects that allows you to combine large amounts of scientific information, simplify its search, increase accessibility for database users, and also unite research participants in a single information context. The biosample database (BioSample) was developed to store descriptive information (metadata) of specifically biological samples, from which deposited genetic data were subsequently obtained [9]. The next (lower) level of organization in INSDC databases is the level of description of genetic data, i.e., metadata about genomic assemblies and sequencing data.

The NGID retains the levels of metadata presented in the INSDC databases and adds a new level of metadata: a biological object located in the hierarchy between the BioProject and the BioSample (Fig. 2). A bio-object allows you to link several BioSamples taken from one living organism or one cell culture, but under different conditions, at different times or from different tissues. For example, a bio-object can be metadata on a specific laboratory rat (R. norvegicus) with a description of the general characteristics of this animal, while the BioSamples will be metadata for each individual biomaterial sampling at different time intervals after exposure to medicinal substances. The introduction of a bio-object retains compatibility with the metadata of the INSDC databases, since during import this level can be restored based on the metadata of the BioSample.

Fig. 2.
figure 2

Hierarchy of the NGID-model metadata. The levels of metadata that are directly related to genetic data are highlighted in gray.

Thus, each of the levels of the metadata hierarchy allows solving a separate user task: a BioProject used for searching for studies by topic and tasks, as well as structuring and linking objects of lower levels; a bio-object used for saving metadata on a specific research object with the ability to search according to attributes and structuring of samples, obtained from this object in a different way and/or under different conditions; and a BioSample used for the preservation of metadata on a specific sample of the object of study, including all the nuances of sampling and effects on the body, taking into account possible dynamic changes.

One of the most important metadata attributes is the taxonomic identifier. The NCBI Taxonomy Database, a curated database organized in the form of a directed graph which systematizes information about all domains of living organisms, was taken as the basis of taxonomy at the NGID [10]. At the beginning of 2023, the NCBI Taxonomy Database included information on more than 100 000 genera and 2 million species, and the database dynamically changes both due to replenishment (in 2022, 80 thousand new species and almost 4 thousand new genera were added), and by adapting the changes made to the main taxonomic codes [11]. The taxonomy includes seven main ranks: domain or superkingdom (eukaryotes, archaea, bacteria, viruses), phylum or type, class, order, family, genus and species, as well as intermediate ones, which allow the more accurate classification of organisms. It is worth considering that the NCBI Taxonomy Datavase is not the only one used in bioinformatic databases, for example, taxonomic classifications within the Silva-LTP (All-species Living Tree Project) project and the GTDB (Genome taxonomy database), which classifies prokaryotes based on their genomic sequences and phylogeny of marker genes, are very popular [12, 13]. The NGID provides support for several variants of taxonomic classification with the possibility of their further mutual integration.

2 SOFTWARE AND INTERFACE

Software Solution Architecture

The software architecture provides for the following tasks:

(i) high-performance processing of genetic information using supercomputer resources of the National Research Center “Kurchatov Institute”;

(ii) loading of genetic information, search (by metadata, homology, BLAST and its variants) and access to information;

(iii) taxonomic and phenotypic classification of BioSamples;

(iv) genomic analysis;

(v) joint work with BioProjects, including a communications environment, i.e., a single information space for developers and researchers.

Data analysis is carried out using a heterogeneous infrastructure that includes a cloud platform and a high-performance computing system. The computing capabilities of the NGID will be provided to researchers for collective use within available quotas. Analytical pipelines are formed from software processors at the NGID. The user can develop and transfer their processor for use in the NGID, describing it in accordance with the specified rules. The application programming interface (API) at the NGID will provide support for a wide range of application software packages and platforms for genomic analysis. The containerization system contains application programs along with their runtime environments required for each application program, which will allow the organization of pipelines for data analysis both using supercomputer and cloud resources. The NGID architecture is shown in Fig. 3.

Fig. 3.
figure 3

The structure of the NGID model.

The system also provides integration with popular bioinformatics platforms, which provides the seamless launching of custom scenarios on one of the platforms from a web portal or through a public API. Support for these platforms ensures the availability of a large set of ready-made tools and scenarios. For seamless interaction with platforms, the BioControl service has been developed, which provides a unified interface for interacting with bioinformatic platforms, including the automatic creation of a system environment for requesting scenarios, as well as monitoring and managing computing tasks.

Users will interact with the NGID through a single information portal (web portal), as well as using a software interface. For sequencing centers with a large amount of sequencing data, it is advisable to create dedicated communication channels.

Interface

The main sections of the NGID web portal are:

(i) a data library containing deposited data, annotation, metadata, reverse data;

(ii) services of a user’s personal account;

(iii) an administration console, which includes tools for managing the NGID, and the monitoring and diagnostics of information and computing infrastructure.

The main services available to users after authorization include:

(i) deposition and validation of genetic data;

(ii) analysis of genetic data, including format conversion and pre-processing;

(iii) search and visualization of genetic data, including a genome browser and 3D-visualization tools;

(iv) monitoring data quality, providing means for the work of data curators;

(v) a notification system that sends messages via communication tools.

3 DATA DEPOSITION

The data-deposition process involves the entry of metadata, the loading of genetic data, and their validation. Metadata is created or selected from the levels of metadata already created in the concept. The NGID provides mandatory fields for metadata, depending on the level of metadata and the type of BioSample, and groups of fields can be added, of which not all, but at least one field must be filled in. User input in metadata fields is validated against the field type (number, string, date, etc.) as well as the fill-in instructions, if any, (for example, limiting the sample age from 0 to 200 years). One of the key fields when filling in metadata is the choice of taxon, which is carried out by searching for a node in the existing taxonomy in the system. After passing automatic checks for correctness of the entered metadata, the user is given the opportunity to upload genetic data in the following ways:

(i) through the web interface of the NGID portal;

(ii) via uploading data using client software (e.g., ftp/webdav clients, rsync);

(iii) using the API.

NGID plans to support the following types and formats of data:

(i) sequencing data, including high-throughput data (ab1, fastq, h5, fast5);

(ii) sequencing data mapped to the reference genome: (BAM, SAM, CRAM);

(iii) nucleotide and amino-acid sequences (fasta);

(iv) annotated nucleotide and amino-acid sequences (gbk, gff, gtf);

(v) structures of biomolecules;

(vi) data on structural variations in the genome (vcf);

(vii) track data (bed, bigwig, bedgraph);

(viii) genomic and transcriptomic profiling data using chips (microarray);

(ix) 3D molecular structure data, including proteins and nucleic acids (PDB, PDBx, mmCIF).

The NGID implements the batch uploading of genetic data, which provides automated uploading and allows large files or a large number of files to be uploaded. A batch upload is the uploading of data from a file generated in text format with a list of files and the necessary metadata. Also, the NGID has developed protocols for the exchange of genetic and metadata with other databases of genetic information.

4 ANALYTICAL FUNCTIONALITY

Search for Homologous Sequences

The search for homologous sequences is implemented using programs for the pairwise alignment of both nucleotide (DNA and RNA) and amino acid (protein) sequences and includes several modes of operation: searching the nucleotide database by nucleotide query; searching the protein database by amino-acid query; searching the protein database by nucleotide query, alignment is formed on the basis of six variants of translation of the query; searching the nucleotide database by amino-acid query, alignment is formed on the basis of six variants of translation of the reference data; searching the nucleotide database by nucleotide query, alignment is formed on the basis of 36 query translation variants and reference data.

Setting a query and specifying the parameters by the user includes three steps:

(i) the formation of a query is carried out by inserting the desired sequence into a special interface window or by uploading a file;

(ii) database selection according to user tasks and sequence type;

(iii) specifying the parameters is an optional step since the default parameters suitable for typical tasks are already pre-installed.

The search-results page is a summary table of results that includes basic alignment characteristics, a graphical display of alignments, and an alignment preview. From the search-results window, you can switch to the corresponding entry in the NGID. For the convenience of users, the function of downloading a sequence or sequences directly from the output of the homology search tool can be implemented.

Genomic Browser

The user can visualize genomic data in a genomic browser that supports annotations in standard bioinformatics formats, including those supported by NGID processors. The genomic browser is focused on studying the structure of the genome, selecting certain genomic regions and working with them, including individual sequences marked in these regions. The second application for the user is visual inspection of the structure of a gene, including the location of the domains, the presence of specific subsequences, and the secondary and tertiary structure of the gene. The genomic browser provides the user with access to both public data and restricted data deposited by that user.

Typical Data-Processing Scenarios

The validation of data during the process of deposition ensures their integrity, compliance with the declared bioinformatic format, and the possibility of further processing and use by NGID users. Depositing data into the NGID is carried out in various formats, some of which may have specific features. For example, parity validation is used to check paired reads in the FASTQ format: usually such data is presented as a pair of files and must have the same number of records. At the same time, their preliminary processing by the user may lead to violation in the correspondence of the records in the files and lead to errors or the impossibility of further processing.

Typical scenarios for processing genetic data can be implemented both by individual processors and by software pipelines. Typical scenarios performed by elementary processors include:

(i) obtaining quantitative statistical data on genomic assemblies: total length, number and median length of contigs, G+C composition;

(ii) quality assessment and statistical characteristics of sequencing data: number of reads, their average length and its standard deviation, distribution of quality values and G+C composition, determination of the share of service sequences;

(iii) taxonomic classification of reads and genomic assemblies;

(iv) assessment of the completeness and contamination of genomic assemblies of prokaryotes and eukaryotes based on the search for and counting of marker genes;

(v) mapping of reads to the reference genome;

(vi) quick search for homologous genes;

(vii) search for sequences corresponding to hidden Markov models and covariance models.

Data-processing tools, which are a combination of elementary processors and, if necessary, databases, can perform the following tasks, for example: the assembly of genomes with the preprocessing of reads, prediction of protein-coding genes, annotation of genomes of prokaryotes and lower eukaryotes, and the analysis of full metagenomic data. A significant number of software pipelines with a modular architecture will also be available, which will make it possible to carry out: the search for mutations in somatic and germ cells based on whole genome or targeted sequencing; the analysis of transcriptome sequencing data with gene and isoform counts and extensive quality control; DNA methylation analysis based on bisulfite sequencing data; analysis of amplicon sequencing data for metagenomic studies; analysis of ancient DNA with high reproducibility; the assembly of metagenomes and reconstruction of genomes of nonculturable organisms; assembly and annotation of bacterial genomes; phylogenetic analysis of bacterial genomes using sequencing data.

For the NGID, software pipelines have been developed and tested for the following tasks: genotyping by means of whole-genome- or amplicon sequencing; quality analysis and processing of sequencing data; assembly and annotation of a prokaryotic genome; phylogenetic analysis with multiple sequence alignment with further automatic alignment processing.

The capabilities of genetic-data analysis offered to users are suitable for use both in an extremely wide range of tasks (biotechnology, biomedicine, molecular evolutionary biology, agriculture, animal husbandry, microbiology, ecology, etc.), and at various stages of analysis, i.e., from preliminary data processing to obtaining final results.

5 NGID MODEL

In order to test the technical solutions adopted in the course of the technical design of the NGID, a model of the system was created at the National Research Center “Kurchatov Institute.” The NGID model is a software and hardware system deployed using available computing resources of the National Research Center “Kurchatov Institute” and includes both newly developed software and adopted open-source software. The NGID model allows up to 20 users to work simultaneously; a storage facility with a total useful volume of 1 PB has been organized.

The application software complex implements the following functionality:

(i) storage of genetic data. The supported formats are .fasta and .fastq. The possibility of the batch uploading of data is implemented by preloading a description in the .tsv format and the data files themselves into the working area of the storage;

(ii) validation of genetic data. Data formats are automatically checked and nucleotide sequences are annotated;

(iii) storage and organization of access, taking into account the differentiation of access rights to genetic data. The object-oriented data model includes the following structure: BioProject → BioSample → bio-object;

(iv) the search for genetic data both on the basis of the specified description (metadata) and on the basis of the similarity of nucleotide sequences;

(v) work with data from external sources while maintaining cross links between data and indexing for quick searching;

(vi) data analysis using bioinformatic pipelines. Implemented integration of popular platforms that provide a wide range of ready-made scenarios;

(vii) work with the system is possible in a secure network through the developed thin client: a web portal.

Pilot operation confirmed the correctness of the technical solutions incorporated into the system architecture, and also confirmed the possibility of scaling the computing platform and data-storage infrastructure. Currently, the NGID is available at the National Research Center “Kurchatov Institute” and external users can be provided with remote access in agreement with the developers. Based on the experience of using the NGID model, proposals for improving the functional characteristics and improving the convenience of work will be taken into account when creating the industrial version of the system.

CONCLUSIONS

The system is currently available for testing in the pilot (experimental) operation mode. Amendments to Federal Law no. 643-FZ of December 29, 2022 “On State Regulation in the Field of Genetic Engineering Activities,” according to which data obtained in the course of genetic-engineering activities and molecular genetic analysis must be deposited in the NGID, come into force on September 1, 2024. At the same time, no later than December 31, 2025, genetic information obtained by state corporations, companies, budgetary and municipal institutions, as well as business entities, in the shareholder capital of which the share of the Russian Federation, constituent entities of the Russian Federation, and municipalities is at least 50%, must be deposited in the NGID [2].

Work on the creation of the NGID is being carried out in pursuance of the List of Instructions of the President of the Russian Federation following the meeting on the development of genetic technologies in the Russian Federation of May 14, 2020 no. Pr-920 (no. Pr-920 of June 4, 2020)