Background

The application of genomics to precision medicine holds great promise for the implementation of tailored diagnostics, optimized patient care management, and personalized therapies in healthcare. The past decade has seen the development of technological and computational innovations to bring both DNA-sequencing methodologies and bioinformatic algorithms into routine standard-of-care for diagnostic medical genomics. While there has been a broad consensus in terms of bioinformatics best practices, quality control metrics, and community adoption of variant calling and classification standards, substantial variability remains among variant curation tools and data sharing by health care providers, clinical diagnostic laboratories, and researchers.

Clinical interpretation of genomic sequencing data requires both the standardization of variant classification guidelines, as well as consistency in the workflow and evidence considered when determining the relationship between a variant and a disease phenotype. In 2015, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) released guidelines for the interpretation of germline genetic variants [1]. These germline variant curation guidelines have been broadly adopted by clinical genetic testing laboratories globally [2]. Additionally, the National Institutes of Health (NIH)-funded Clinical Genome Resource (ClinGen) Consortium [3] has further developed refined and standardized evaluation criteria of sequence variant pathogenicity [4,5,6,7,8,9,10,11,12,13,14,15,16]. Despite these efforts, the uniform adoption and application of these frameworks have proven challenging without robust computational infrastructure and curation software to consistently guide biocurators through these complex germline variant curation guidelines.

Here we present the ClinGen Variant Curation Interface (VCI), which is a comprehensive germline variant classification platform designed to support both individual and group classification in accordance with the ACMG/AMP germline classification guidelines. The VCI is intended to be a publicly available variant curation tool which programmatically guides users through a standard process for variant evidence classification and application of ACMG/AMP guidelines in a controlled workflow to enforce rigor and quality in germline variant classification (Fig. 1). The VCI aims to serve as a central platform for clinical variant classification that fills a gap in the learning healthcare system and facilitates the widespread adoption of standards for clinical curation.

Fig. 1
figure 1

ClinGen FDA-recognized variant curation process and VCI. Overview of the ClinGen variant curation process using the VCI, an FDA-recognized workflow. Biocurators select a variant and evaluate evidence that falls into six categories. VCI viewers may view all evidence available for any variant using the VCI. The VCI supports users in making a final pathogenicity classification keeping with the ACMG/AMP guidelines. ClinGen expert panels then disseminate their variant classifications through two community resources: the Evidence Repository (ERepo) and ClinVar

Implementation

The VCI curation platform has been developed to facilitate the Federal Drug Administration (FDA)-recognized ClinGen variant classification process, support transparent evidence review, and provide timely dissemination to the genomics community. Users can curate individually or communally in groups known as affiliations. The VCI programmatically displays relevant data types from external sources (Table 1) and displays evidence identified by other VCI users in an organized user interface enabling an environment to document ACMG/AMP criteria codes.

Table 1 VCI displayed information type and source

The core elements of the VCI data model are shown in Fig. 2. The full data model is stored in JavaScript Object Notation (JSON) format with references to data elements. The data model is centered upon a variant classification, with attributes consisting of data and context related to asserting the variant’s pathogenicity. The classification model is based on the combined variant, disease, and mode of inheritance data models. Each variant is evaluated by biocurators against specific evidence types which are reflected in the VCI’s data models (e.g., population data, experimental data, computational data), as well as literature-based evidence which can be manually added by a biocurator and related to any of the other evidence types. Biocurator-selected evaluations of the evidence criteria provide a computed variant pathogenicity, which can be manually overridden by biocurators based on expert opinion consistent with the ACMG/AMP guidelines.

Fig. 2
figure 2

Core VCI data model used for storing and retrieving data as JSON documents. The VCI uses a data model centered on a classification (dark blue center box), with relationships to other data models (white boxes). Each assertion is uniquely defined by the Variant, Disease and Mode of Inheritance models and is owned by a user or affiliation. It has two core attributes A status indicating its state in the current workflow cycle and B selected classification (Fig. 1). It uses different types of evidence (see Evidence Collection in Fig. 1) and applies an evidence criteria to arrive at the selected classification. Each evidence type can use articles as supplementary evidence. As an interpretation progresses through review statuses (such as provisional, approved, and published), snapshots of the full data at each review step are created. The relationships between data models are represented here, with 1:1 (solid green lines), 1 to many (N), or many to many (N to N) indicated

The classification model is designed to support the ClinGen workflow of variant curation, which is an iterative and manual process, with biocurators making criteria evaluations for the assessment of genetic variant pathogenicity, and expert panels considering and approving final pathogenicity classifications. The platform models this process in two ways, first by marking all curations with the most recent status, progressing from “In-progress” (interpretations where the evidence is still being evaluated), “Provisional” (interpretations completed by the primary biocurator but awaiting expert approval) and “Approved” (interpretations which have been fully reviewed and classified). Should a revision to the evidence evaluation need to be made to an “Approved” interpretation the VCI will save those changes under the status “New-Provisional,” which will require a new approval. Approved classifications by ClinGen Variant Curation Expert Panels (VCEPs) may be sent to the ClinGen Evidence Repository (ERepo; https://erepo.clinicalgenome.org/) [26]. The ERepo is intended to provide access to variant level evidence used and applied by VCEPs in the classification of variants. Upon submission to the ERepo, approved classification will have the additional “Published” status appended. Secondly, the classification model only stores references to its related models (e.g., variant, disease, evidence, evidence criteria), which also store the most recent information. When a classification has a status of “Provisional,” “Approved,” or “Published,” the snapshot model is used to store an instance (point in time) containing all the related data. This allows changes to the related data to be identified, while previous data, which may have been used to make a criteria evaluation, are still preserved in the snapshot. Snapshots published to ERepo and the ClinVar database (https://www.ncbi.nlm.nih.gov/clinvar/) [27] undergo a transformation to the ClinGen Interpretation model (http://dataexchange.clinicalgenome.org/interpretation/index.html) [28], which aligns with a related community model; the Monarch SEPIO Framework (https://github.com/monarch-initiative/SEPIO-ontology/wiki/SEPIO-Overview) [29]. The SEPIO framework was chosen as it provides an ontology-based modeling framework which supports the scientific assertions and provides a structure for the evidence and provenance supporting those assertions.

The VCI is accessed through a web browser where users can perform curation activities including review of imported evidence, entry of evidence gathered from published and unpublished sources, ACMG/AMP criteria application, pathogenicity evaluation, and classification review and approval. The classified variants from ClinGen-approved VCEPs are then shared with the ERepo and ClinVar to enable peer review and public access.

The software for the VCI is freely and openly available in perpetuity via publicly accessible web pages and two publicly available GitHub repositories, one for the 1.0 legacy code (https://github.com/ClinGen/clincoded/) [30], and one for the current 2.0 codebase (https://github.com/ClinGen/gene-and-variant-curation-tools) (Table 2). The VCI website (https://curation.clinicalgenome.org) [31] is developed as a common interface for both the VCI and the related Gene Curation Interface (GCI), which is used to evaluate the strength of evidence that variation in a particular gene causes a particular disease. These two tools, VCI and GCI, use the same platform and share components such as a user database and classification data. User access to the VCI and GCI is available by authenticated login. Login permissions are required to document the provenance of evidence added to the interfaces.

Table 2 VCI software components

Users access the VCI web application via the browser and execute the workflow tasks needed to perform variant curation. The current deployment, VCI v2.0, utilizes cloud-based web development best practices and a “serverless architecture,” which is a cloud development approach where all application resource management and scaling needs are automatically determined and handled by the cloud services. All the components essential for the application including authentication, gateways to receive and respond to browser requests, microservices, and storage are provided by Amazon Web Services (AWS). This scalable and robust architecture is based on several AWS serverless components shown in Fig. 3, the role of key components is further described here. The Application Programming Interface (API) Gateway handles tens of thousands of requests per second and provides automatic schema validation of data, ensuring data integrity in the VCI. Lambda spawns microservices to store or retrieve data, managing and scaling the computing resources required by the VCI. DynamoDB is a flexible, document-based database that provides constant load-independent performance, supporting the VCI in long-term goals of scaling to large numbers of variant classifications and providing bulk variant curation support, while Simple Storage Service (S3) stores the large VCI database. Additionally, Cognito is used for user management, and Amplify for web content integration with backend microservices. The user interfaces are created using standard JavaScript programming (ReactJS), and they obtain information from the database via an API using a standard JSON format. For comparison, this continuous integration and deployment provide greater reliability and cost-savings relative to the initial deployment of the VCI v1.0 built following a classical three-tier architecture with a web-frontend component (ReactJS), backend business logic layer (Python and Pyramid), a split persistence layer containing the state and metadata database (PostgreSQL), and search indexes (AWS Elasticsearch).

Fig. 3
figure 3

VCI platform components overview including schema and serverless architecture. The platform is web-browser based and uses AWS cloud services. An External Resources Manager retrieves population, predictive, functional, and other variant and gene-level data from external sources (Table 1) via APIs. In addition, the user can add in curated evidence, evaluate against ACMG/AMP guidelines, and save the classification for review and approval. The approved classifications are then submitted to the ClinGen Evidence Repository. All data are saved in a database via microservices and can be accessed via queries. The Amazon Cloud Services provide the microservices to store and retrieve data adopting a Serverless Architecture utilizing the following components and services: Amplify, Cognito, API Gateway, Lambda, DynamoDB, S3, and the user-facing web interface is created using React JavaScript Library

Service components include an external resource manager, which is responsible for obtaining data from external sources (Table 1) that feed into the variant, gene, disease, population, predictive, functional, and gene data models.

Finally, ClinGen provides supplemental resources for VCI users including general information about biocuration, summary videos outlining the concepts and methods behind biocuration, and links to biocuration resources such as ClinGen’s documented standard operating procedures (https://clinicalgenome.org/curation-activities/variant-pathogenicity/) [32].

Development process

The VCI software and product development teams worked alongside the ClinGen Variant Curation Interface Task Team to develop the initial platform. This product was designed through a user engagement process and VCI v1.0 was launched for use in September 2016, with new features developed and released monthly. The completely re-architected and updated VCI v2.0 platform was launched in December 2020 and is the current production version.

VCI development continues to evolve with the input of core members of the ClinGen variant curation community that meet twice per month with the VCI development team. This group includes members of the ClinGen Sequence Variant Interpretation (SVI) Working Group, which provides guidance on how to interpret, refine, and standardize the ACMG/AMP guidelines [5,6,7, 11, 16], and members of ClinGen’s VCEPs. Additional guidance for VCI development comes from the ClinGen Data Access, Protection and Confidentiality (DAPC) Working Group, which reviews tools and data practices in the ClinGen curation ecosystem to ensure that software development efforts are informed by updated data sharing policies. Detailed best practice recommendations for biocurator use of the VCI and associated resources for variant classification are provided in the ClinGen Variant Curation Standard Operating Procedure.

Variant identification and evidence

It is possible to define a variant in several different ways. This promiscuity arises because of the availability of multiple transcripts and genomic reference sequences and various ways to describe insertion/deletions. As a result, unambiguous identification of variants is critical to the downstream usability of the curated data. The VCI identifies a variant by either a ClinVar Variation ID [33] or a ClinGen Allele Registry ID [17]. ClinVar IDs are assigned to each set of submitted variants, generally resulting in a single variant being associated with a single ID, ClinVar does support two subclasses of IDs, allowing IDs for variants being directly interpreted and those being interpreted in the context of a set of variants. ClinGen’s allele registry provides a globally unique “canonical” variant identifier (CAid) on demand for variants. This enables aggregation of variant information from different sources [17]. The variant is associated with a Mondo Disease Ontology [34] term and a Human Phenotype Ontology [21] mode of inheritance term by the biocurator or VCEP. Recognizing that a variant can be curated for more than one potential disease, each user is restricted to one variant classification record per variant. Within the interfaces each variant is titled based on a hierarchy of naming conventions, preferentially using a title based on the Matched Annotation from NCBI and EMBL-EBI (MANE) Select [35] transcript when available (Fig. 4A).

Fig. 4
figure 4

The basic information view. The VCI has six-tab views that collate and display variant information from external and internal sources to biocurators. A The top title view is always viewable and shows the variant title, links to the variant in external resources, and key curation information for the record. B The criteria bar displays evaluated criteria and the calculated pathogenicity. C The basic information tab displays any curations available for the variant in the VCI and ClinVar and transcript information from RefSeq and Ensembl

The VCI aggregates and displays multiple types of evidence about a variant, separated into six tabs structured by data type, providing a rich and structured evidence gathering experience to biocurators, while supporting variant classification in accordance with the ACMG/AMP guidelines. This promotes consistency in terms of the evidence evaluated, application of the ACMG/AMP criteria, and pathogenicity calculations. In keeping with the ClinGen goal to support appropriate community data sharing, all evidence added by users is viewable by any other VCI user. While all evidence is viewable by all users, a user’s evidence evaluations and pathogenicity calculations remain private until the classification record has been finalized (set to “Approved” status), at which point other users can view, but not edit the final classification in the VCI.

Automated evidence

The VCI programmatically retrieves and displays many different types of evidence for each variant (Table 1). This includes the many possible variant nomenclatures on different transcripts and human genome builds, population frequency data, in silico prediction scores, conservation data, gene and protein resources, and all classifications and submissions for the variant currently present in ClinVar or the VCI. When evidence is unavailable for direct display via API, dynamic links to external information sources are embedded within the relevant evidence tab. Relevant external information sources are identified in conjunction with core members of the ClinGen variant curation community described above.

Manually curated evidence and structured data capture

Users can manually add information relevant to the variant being classified from published articles for any evidence type into the relevant sections in the VCI. Additionally, structured data capture is supported for published functional data and for published and unpublished case and segregation evidence. Such structured data inputs ensure curation consistency, as they organize and accurately define the information so that it can easily be retrieved, facilitate searching of the captured data, and enable downstream data processing applications such as data mining and machine learning. Two examples of structured data capture within the VCI are outlined below (Functional data capture and Case level data capture).

Functional data capture

The ACMG/AMP guidelines [1] require the assessment of well-established in vivo or in vitro functional studies showing “no damaging effect” (BS3) or “supportive of damaging effect” (PS3) on protein function or splicing. We have developed a structured framework with the narrative of (1) method, (2) material, and (3) effect (with or without a quantifier), using standardized terminology from ontologies, for users to define the functional data they have derived from published articles in a consistent and reproducible way. We provide users with a standardized template for capturing these structured data which they can then submit to ClinGen’s Functional Data Repository (FDRepo [https://ldh.genome.network/fdr/ui/]) [36]; subsequently, these granular functional data for each variant are viewable in the “Experimental” tab in the VCI. Future enhancements will include updating the structured narrative and data fields in accordance with new standards [7] and augmenting capacity for bulk annotations to be imported from literature annotations or databases of functional evidence including the increasing availability of data from multiplex assays of variant effect [37, 38].

Case level data capture

Case level and segregation level data are critical to pathogenicity evaluations using the ACMG/AMP guidelines. However, individual case observations have the potential to be linked to individual patient identities if enough ancillary information is also included about the case. It is for this reason that the case segregation tab of the VCI prompts users to remember the Terms of Use for this tool, which include prohibitions on entering protected health information (PHI) or other sensitive information that could possibly identify an individual data subject and entering only the minimum necessary information to resolve a particular case. To further protect data subjects from the possibility of re-identification (which is also strictly prohibited for users of the VCI as stated in the Terms of Use), individual case-level data are not made publicly available through ClinGen tools except in aggregate. When entering in case observations or pedigree segregation evidence, users are directed to a form that has separate fields to capture each distinct case-level observation or co-segregation event. Then, individual counts for each distinct data type are summed together and analyzed in aggregate along with the same information from other evidence sources (Fig. 5B).

Fig. 5
figure 5

Case and Segregation Evidence Capture. The structured data capture for the case/segregation view in the VCI, including the A top title view B evidence sources are captured and structured so that users can quickly see all sources and the summed individual counts from the pooled evidence for specific ACMG/AMP criteria (shown here is PM3)

Curation workflow

ClinGen variant curation through the VCI enables the use of the nomenclature, criteria codes, and rules defined in the joint 2015 ACMG/AMP guidelines on variant classification [1]. Embedded flexibility is designed to allow biocurators to incorporate modifications and additional guidance produced by ClinGen’s Sequence Variant Interpretation WG as well as disease specifications from ClinGen VCEPs following the FDA-recognized validation process. To further aid this process, the VCI allows groups of users to curate variants as a single entity, known as an “affiliation.” VCI affiliations are often ClinGen VCEPs [39, 40]; however, any group of users who wish to curate variants together (e.g., a clinical or research laboratory) may form an affiliation. Once a VCI user initiates a classification, it belongs to that individual or affiliation and can only be edited by them.

The ACMG/AMP guidelines provide a set of criteria to be considered when classifying a variant. The VCI is designed to help users evaluate the applicability of these criteria in an efficient and structured way (Fig. 1). The VCI groups criteria by evidence types: displaying both the relevant criteria and any related evidence. These groups are (1) Population (known variant allele frequencies), (2) Variant Type (predicted impact of the variant on the gene product), (3) Experimental (functional assay data), and (4) Case/Segregation (relevant observations of the variant). The VCI also groups together gene-focused resource links, and basic information, displaying ClinVar and VCI curations for the variant as well as the molecular consequence of the variant on all known transcripts (Fig. 4C and Table 1).

For each ACMG/AMP criteria, users are provided a description of the guideline in the VCI. The user can view, add, and evaluate the relevant evidence and then set their criteria evaluation and write an explanation. As some criteria are applicable at different pathogenicity strengths, users can choose the appropriate strength from a pulldown list containing only the appropriate strength options for that criterion. For instance, the available options provided for evaluating PP1 are Not Evaluated, Met, Not Met, PP1_Moderate, and PP1_Strong. As a user saves their evaluations, a criteria bar (Fig. 4B) in the header of the interface keeps track of their progress by indicating which criteria have been “Met” (solid color background with white criteria code), “Not Met” (gray background with colored criteria code) or remain “Not Evaluated” (white background with colored criteria code). If a user scrolls over individual criteria codes in this bar, they will see a description for each criterion and they can click on individual criteria codes to link to the pertinent section in the VCI. Additionally, a progress bar shows the number of criteria met according to the strength of the evaluation and whether they are “Benign” or “Pathogenic” and automatically calculates the pathogenicity each time an evaluation is saved or updated. This auto-classification is based on the default guidelines for weighing and combining the Pathogenic and Benign evaluated ACMG/AMP criteria as laid out in Richards et al. [1]. At any time, a user can view an “Evaluation Summary” that summarizes all their evaluated evidence. If a criterion code is not evaluated, then it would not be considered in the calculation of a predicted classification. Once a biocurator is satisfied, they have reviewed all pertinent evidence and evaluated all relevant criteria; they can save their classification as “Provisional”. This generates a PDF version of the “Evaluation Summary” that can be distributed among the VCEP membership domain experts to aid in their review process. Upon satisfactory completion of the review process, a final classification can be saved as “Approved,” at which point the “Evaluation Summary” can now be viewed by all VCI users.

FDA recognition and data dissemination

The VCI generates an output file of the final variant pathogenicity classification in an auto-generated format compatible with ClinVar submission specifications. This is intended to facilitate timely dissemination of variant classifications to the genomics community via ClinVar and is a requirement for ClinGen VCEPs. The ultimate goal is to support fully automated, API-based ClinVar submission through the VCI once ClinVar provides support for API-based submission. Once submitted, a “submission to ClinVar” (SCV) identifier is obtained and can be viewed in the VCI variant record.

The ClinGen variant curation process was recognized by the FDA in December 2018 [41] and is followed by all ClinGen VCEPs. The evidence curation data and pathogenicity classifications generated within the VCI by ClinGen VCEPs are therefore considered to be valid scientific evidence that can be used to streamline the test development and validation processes. As such, additional steps and requirements apply to the information specifically generated through the ClinGen VCEP variant curation and classification process. Specifically, all the evidence that has been curated and evaluated, along with provenance should be made publicly available and easily accessible. With this in mind, the VCI saves all evidence that is evaluated by its users. In addition, upon final approval of a classification from a ClinGen VCEP, the VCI facilitates data flow to the ClinGen Evidence Repository (ERepo), where the finalized classifications and associated evidence evaluation is published. Importantly, the VCEP generated variant record in the ERepo includes comments for specific codes enabling in-depth and transparent data for peer review (Fig. 1). These publicly accessible displays of the final ClinGen VCEP variant classifications are accessed via the ERepo API at https://erepo.clinicalgenome.org/evrepo/ [26].

Current status and users

VCI currently has over 1100 registered users, two thirds of whom are members of ClinGen VCEPs (Fig. 6). The VCI is publicly accessible (with registration) for variant curation. When curating together as an affiliation, all members can view and edit all information added by anyone in that affiliation. This popular feature is currently used by 79 registered affiliations, most of which represent official ClinGen VCEPs but also include other affiliations of ClinGen members (e.g., institution-specific biocurator teams) and groups unrelated to ClinGen (e.g., clinical and research laboratories).

Fig. 6
figure 6

VCI platform growth over time. A Number of curated variant classifications performed in the VCI over time. B The number of biocurators and biocurator affiliations accumulated over time are noted at the top of each bar

Results and discussion

Here we present the development of a genetic variant biocuration platform for health care providers, researchers, and the medical genetics community to determine which gene variants are causal for a disease. The VCI supports the FDA-recognized ClinGen variant curation process and combines clinical, genetic, population, and functional evidence with expert review to classify variants into ACMG/AMP 2015 variant classification guideline categories [1]. Primary features of the VCI include the ability to (1) curate individually or in groups, (2) associate pertinent evidence with variant classifications, (3) allow users to assess evidence per variant curation disease/gene-specific protocols, (4) enable users to save provisional records, (5) support an expert review process of curated evidence, and (6) automatically publish classifications and underlying criteria assessments to the ERepo

Future VCI improvements will focus on enhancing scale, workflow, throughput, and support ongoing compliance with FDA recognition of the ClinGen Variant Curation Expert Panels through the FDA Human Variant Database program. The current VCI v2.0 platform has over 6300 variant classifications in different curation stages, and our modernized architecture is able to scale to support over 1 million future classifications. We plan to enhance workflow usability and curation efficiency by making the platform more proactive with (1) task management (supporting assigning of variant classification records to users) and action items (alerts to users), (2) support for bulk variant curation workflows, (3) automatically bring in additional variant evidence data and monitor major data changes via a streaming service so curations can be updated as needed, and (4) provide customized curation experiences based on VCEP specifications. To ensure compliance with FDA requirements for the auditability of ClinGen variant curation process, the VCI database maintains a complete audit trail of all saved curation actions. We will further support FDA compliance with (1) additional traceability, (2) permanent archiving, (3) regular knowledge updating through literature and database monitoring, and (4) update alerts provided to curation teams.

Conclusions

The VCI provides needed software infrastructure and a comprehensive curation platform necessary for supporting variant classification, a critical step in the use of genomics in medicine. This global open-source platform aids individual biocurators and teams of collaborating biocurators in performing the complex task of variant curation in an efficient workflow to enforce rigor and quality in variant classification ultimately contributing to scientific advancement and informing health care management.