Background

The advent of next-generation sequencing (NGS) technologies has revolutionized the field of genomics, enabling the analysis of large-scale genomic datasets with unprecedented accuracy and resolution. However, the sheer volume of data generated by NGS requires efficient and reliable tools for variant analysis. This analysis typically involves the identification of disease-causing variants by filtering out irrelevant variants using annotation-based filtering, a critical step in the analysis pipeline that requires an understanding of both the case's conditions and available annotations [1, 2].

Several standalone and web-based tools, such as ANNOVAR, wANNOVAR, VEP, and SnpEff, are available to annotate variants [3,4,5,6]. However, variant filtration, the subsequent step in the analysis pipeline, requires specialized, flexible, and user-friendly tools. Graphical User Interface (GUI) based tools, such as VCF.Filter, VCF-Miner, and BrowseVCF, enable users to filter any desired annotation, while others, like GEMINI has predefined annotations that restrict the user [7,8,9,10]. Command Line Interface (CLI) based tools, such as GATK-VariantFiltration, VCFtools, BCFtools filter, and Exomiser, require advanced bioinformatics and programming skills, limiting their accessibility to a broader user base [11,12,13,14]. A comprehensive comparison is provided at Table 1.

Table 1 A qualitative comparison between the most common VCF file filtering tools

This study aimed to develop 123VCF, a user-friendly and efficient GUI-based filtering tool that enables researchers and clinicians to define filters easily through a text file. 123VCF employs a disk-streaming real-time filtering algorithm, efficiently processing variant files without the need to load them into the computer's memory.

Implementation

Effective variant filtering is a pivotal stage in Next-Generation Sequencing (NGS) data analysis, involving variant annotation and subsequent filtering based on user-defined criteria. However, traditional variant filtering tools often suffer from memory-intensive processes, especially when dealing with extensive datasets, as they load the entire input VCF file into memory before applying filters [13]. To address this challenge, we introduce 123VCF, an innovative tool that employs a memory-efficient algorithm for variant filtering, eliminating the need to load the input VCF file into memory. This breakthrough not only ensures faster processing but also enables seamless handling of large datasets.

123VCF is a freely available, versatile, and cross-platform tool developed using Java Swing, and it is distributed under the MIT license. The tool provides users with a user-friendly graphical interface enabling them to filter VCF files based on annotations within the "INFO" and "FORMAT" fields. Additionally, researchers can easily isolate de novo variants in multi-sample VCF files by specifying genotypes for each sample. To ensure simplicity and independence from third-party codes, all components of 123VCF were entirely developed by the authors, resulting in a straightforward and lightweight tool.

The filtering process is initiated by conducting an analysis of the filtering order file in comparison to the header section of the submitted VCF file, ensuring a comprehensive evaluation. Subsequently, each filter is systematically applied to every variant, employing intricate regular expressions rules tailored for string and numerical based filters. Through this advanced approach, only those variants that successfully meet all specified criteria, both in terms of string matching and numerical operations, are selected and documented in the designated output file(s). The underlying algorithm's core concept is visualized in Fig. 1, providing a clear representation of the methodology employed by 123VCF for efficient variant filtering. With its ease of use and powerful filtering capabilities, 123VCF emerges as a valuable tool for researchers and bioinformaticians in diverse genomic analyses.

Fig. 1
figure 1

123VCF algorithm's steps

123VCF offers users the flexibility to include or exclude heterozygous and homozygous variants from the sample, allowing for precise and customized filtering. The tool can generate a Tab-Separated Values (TSV) file containing all passed variants, which can be easily imported into spreadsheet-based programs for further analysis. Additionally, 123VCF can generate another TSV file specifically for variants that overlap with a user-provided BED file, allowing researchers and clinicians to identify possible compound heterozygous variants. These TSV files provide a convenient and customizable way to prioritize and analyze variants of interest. The efficiency of 123VCF were evaluated using a set of variant files and also compared to the most similar algorithms, demonstrating its ability to handle large datasets without compromising performance. The tool's disk-streaming real-time filtering algorithm was found to be efficient, providing accurate filtering results in a short amount of time.

123VCF provides a robust functionality that allows users to define and apply custom filtration orders using plain text files, as outlined in the user manual. This feature offers a high level of convenience, enabling users to utilize their laboratory-specific filters repeatedly without limitations. By incorporating this feature, users can streamline their workflow and enhance reproducibility, ultimately improving the efficiency and accuracy of their analysis. Furthermore, to facilitate the use of this feature, we have provided several filtering order files along with the tool, providing users with a starting point for customizing their own filtering orders.

Results

In order to demonstrate the efficacy of 123VCF, a thorough benchmark analysis was conducted using a diverse collection of VCF files from prominent projects [10, 15,16,17]. To ensure consistency in annotations, ANNOVAR with identical databases was employed for all six VCF files [5]. The benchmark comprised VCF files with varying numbers of variants and samples, and the condensed results are presented in Table 2, providing information on variant and sample counts, annotated VCF file sizes, applied filters, and run time of 123VCF, BCFtools filter and GATK VariantFiltration in seconds.

Table 2 The benchmark results of filtering six well-known VCF files utilizing five different predefined sets of filters

Table 2 clearly shows that 123VCF is an expeditious and effective filtering tool capable of processing large VCF files within seconds. The algorithm of 123VCF demonstrated precision in filtering variants in large VCF files while maintaining optimal performance, providing a significant tool for variant analysis to researchers and clinicians. It is crucial to highlight that 123VCF adopts a distinct filtering strategy compared to other available tools, making direct comparisons challenging. Nevertheless, our rigorous benchmark analysis demonstrates that 123VCF is an exceptionally efficient tool, particularly when multiple impactful filters are employed. In this benchmark, we chose to compare 123VCF with the most similar algorithms, BCFtools filter and GATK VariantFiltration tools. The runtimes of the similar tools are included in the rightmost columns of Table 2. It is important to highlight that we utilized identical uncompressed non-indexed VCF files for this benchmark.

A notable factor affecting 123VCF's performance is the I/O speed of the hard disks. Utilizing Solid-State Drives (SSD) hard drives can significantly enhance its efficiency. To optimize runtimes, we introduced an option to remove filtered-out variants from the output files, as organizing variants in the output files was identified as the most time-intensive operation in our algorithm. Additionally, 123VCF's ability to handle varying file sizes with little impact on performance makes it an invaluable resource for researchers dealing with different scales of data in NGS data analysis.

Conclusion

In conclusion, the development of 123VCF has yielded a highly efficient VCF file filtering tool with notable advantages over existing filtering tools. The tool's versatility in allowing users to define filters based on any desired annotation, and its filtering algorithm contribute to its efficacy in genetic analysis.

Another significant advantage of 123VCF is its standalone architecture, which allows users to run the tool on a local computer without requiring an internet connection. This ensures the privacy of submitted information, making it a highly secure tool for genetic analysis.

In addition, we added a command line interface to 123VCF to make it even more user-friendly and reproducible. This will allow users to easily automate their analyses and integrate 123VCF into their existing workflows. We believe that this new feature will further increase the accessibility of 123VCF and streamline the analysis process. Our team is dedicated to providing the best possible user experience, and we are excited to continue innovating and improving the tool in the future.

Availability and requirements

Project name: 123VCF.

Project home page: https://project123vcf.sourceforge.io.

Operating system(s): Platform independent.

Programming language: Java. Other requirements: Java 1.8.

License: MIT.

Any restrictions to use by non-academics: None.