Background

Due to the rapid growth in sequencing throughput, cost reduction, improved sequencing chemistry, and the possibility to multiplex several sample/patients in a single sequencing experiment, Next Generation Sequencing (NGS) has become a powerful and efficient tool for disease causing variant identification and decoding of a number of genetically heterogeneous diseases including cancer [1]. While NGS technologies have been used to identify variants in several patients in a cost and time effective manner, Sanger sequencing has been used as a complementary method to narrow down and confirm the NGS-detected variants before making clinical decision [24]. In order to identify the potential disease causing mutations with great accuracy, it is essential to use only high quality reads. Therefore, integrated software tools are required, which can eliminate platform specific sequencing errors as well as low quality reads, and perform format conversion, quality trimming and filtering.

Moreover, they should be able to analyze several sample/patients data generated from both Sanger and NGS platforms in a single run and provide execution flexibility by using requirement based customized parameters [5].

At present, several solutions are available for NGS data quality control such as NGS QC Toolkit [6], FastQC [7], PRINSEQ [8], TagDust [9], FASTX-Toolkit [10], SolexaQA [11], TagCleaner [12], CANGS [13], ngs_backbone [14], Galaxy [15], SIMPLEX [16] and QC-Chain [17]. Many of these tools work only for a particular NGS sequencing platform, are limited in their functionality (such as specific input format requirements) and none supports Sanger sequencing format conversion, quality control, trimming and base calling. Therefore, a one-stop, integrated and easy-to-use software tool to analyze Sanger as well as NGS sequencing data is needed, which offers easy handling of input and output data and support analysis of multiple sample/patients in a single run.

We have developed ClinQC, a flexible, integrated and easy-to-use solution for sequencing data processing, format conversion and quality control for Sanger and three NGS platforms including Illumina, 454 and Ion Torrent. We anticipate that this tool will be extremely useful for initial file processing, quality control and format conversion in sequencing based clinical and genomic research studies for expert and not-expert users.

Implementation

ClinQC tool is developed in Python 2.7.9 (http://www.python.org) by using the multiprocessing capability. It uses four other tools including FASTQC [7], PRINSEQ [8], Alientrimmer [18], and TraceTuner [19]. The ClinQC workflow is depicted in Fig. 1 and consist of several sequential steps that lead from the raw sequencing reads to the high quality Sanger encoded FASTQ file for each patient/sample. All parameter settings can be specified in a single configuration file (Additional files 1 and 2). To achieve the optimized performance, ClinQC uses the available hardware (Physical memory and CPU) in a best possible way. A buffer file read write concept was implemented where input and output are partially stored in memory during the analysis, which reduces the computation time and reduces the disk reading and writing workload.

Fig. 1
figure 1

The workflow of ClinQC pipeline. ClinQC tool can be run with a single command. The flow of analysis is depicted from top to bottom. BASE CALLING (violet color) step is only applicable for Sanger data analysis; DEMULTIPLEXING and DUPLICATE & CONATMINATION FILTERING (yellow color) steps are only applicable for NGS data analysis; all other steps (green color) are applicable for both analysis flows. ClinQC generates three final outputs

Results and discussion

ClinQC is an open-source, easy-to-use and integrated tool, which facilitates the analysis of Sanger and NGS sequencing data in a single platform with a common input output model. It supports the rapid analysis of hundreds of sample/patient data in parallel. This pipeline provides full flexibility to customize all parameters using the “ClinQCOptions” file for handling the sequencing platform specific errors and provides proper guidelines for the analysis. All components of ClinQC workflow and their inputs have been summarized in Fig. 1.

ClinQC pipeline

The ClinQC pipeline (Fig. 1) consists of nine sequential steps that starts with raw sequencing reads and ends up with three outputs: 1) QC summary table, 2) FASTQ files with high quality reads and 3) QC report. The detailed description of each step is given below:

  1. 1.

    Base calling

Due to unclear signal in Sanger pherogram files, the base caller of the sequencer always calls ambiguous nucleotide as N. However, it could output more specific ambiguous nucleotides, i.e., R, if signal is not clear between A or G; Y, if signal is not clear between C and T. Therefore, ClinQC uses the tool TraceTunner [19] to improve the base calling and assign more specific ambiguous nucleotides.

  1. 2.

    Format conversion

In this step, ClinQC check the raw sequencing files and their formats and, if needed, converts from native file format to FASTQ with Sanger quality encoding (Fig. 2). Sanger sequencing files are accepted in AB1 and SCF format and NGS files are accepted in SFF, FASTA-QUAL and FASTQ format.

Fig. 2
figure 2

The format conversion workflow of ClinQC. ClinQC takes raw reads in any native file format of their sequencing platforms and returns a unified FASTQ files with Sanger (PHRED) quality encoding

  1. 3.

    Demultiplexing

This step is only applicable for NGS data, where multiple samples are sequenced in a single sequencing run by using the multiplexing method. Based on the barcode sequences (MID: Multiplexed Identifier) provided in the “ClinQCTargetFile” file (as shown in Additional file 3), one FASTQ file per barcode is created. In case of paired-end sequencing, two FASTQ files (one for forward and one for reverse reads) are generated. This step will be skipped if the input data is already de-multiplexed.

  1. 4.

    Adapter and primer trimming

In this step, ClinQC trim the forward and reverse adapter and primer sequences provided in the “PrimerAdapter” file (as shown in Additional file 4) by using the AlienTrimmer [17] tool. AlienTrimmer is a flexible and sensitive sequence trimmer with mismatch tolerance, which allows the customization of the number of mismatches and k-mers based on the data quality and user requirements.

  1. 5.

    Duplicate and contamination filtering

PCR duplicates are a critical known problem, which arise when low abundant fragments are over amplified during the library preparation process. These duplicates can substantially inflate the allele frequency leading to wrong mutation detection and unexpected species richness in metagenomic analysis [20]. Therefore, ClinQC identify and remove duplicates using the PRINSEQ [8] tool to eliminate this technical artifact. Contamination is another problem particularly in metagenomic analysis [21] leading to wrong analysis when DNA from unknown sources is sequenced. Hence, ClinQC assesses and eliminates the contamination from the samples using the PRINSEQ [8] software.

  1. 6.

    Quality trimming

As NGS short read sequencing errors increase with the position in the read [22], ClinQC trim the low quality stretch and Ns from the 5’ and 3’ end of the reads.

  1. 7.

    Read filtering

In this Step ClinQC eliminate the reads, which do not meet the minimum average base quality and the minimum and maximum read length threshold. Thus, only high quality reads, which fulfill all quality trimming and filtering criteria, are kept in the final output file.

  1. 8.

    GC content assessment

GC content is crucial parameter when analyzing NGS data as the under or over representation of GC content could effect the downstream analysis and biological conclusions. Therefore, ClinQC reports the average GC content before and after QC in the summary table for each dataset.

  1. 9.

    Output generation

In this final step ClinQC write three output files: 1) summary output file in HTML format, 2) QC report, and 3) FASTQ files after filtering the low quality reads.

ClinQC input

ClinQC provide a uniform input and output data models for Sanger and NGS sequencing data analysis requiring a minimum of three input files:

  1. 1.

    Target file: The target file contains experimental and sequencing information for each patient (Additional file 3). This file contains patient information including experiment details and raw sequencing files paths. The first column (Patient_ID) is mandatory and should be a unique identifier for each sample. Other patient information is optional and can be ignored for genomic data analysis.

  2. 2.

    Adapter-Primer file: This input file is optional and is required only if primer and adapter sequences need to be trimmed. It is a tab-separated text file with four columns describing the feature-type, id, forward sequence and reverse sequence (see Additional file 4).

  3. 3.

    ClinQCOptions file: The options file contains all input parameters for various parts of the pipeline and the path to the third party tools. A default ClinQCOptions file for Sanger and NGS data analysis is provided separately (Additional files 1 and 2).

  4. 4.

    Sequencing reads: ClinQC support Sanger sequencing reads in AB1 and SCF file format, Illumina reads in FASTQ format, 454 reads in SFF and FASTQ-QUAL format and Ion Torrent reads in SFF and FASTQ format.

ClinQC output

ClinQC produces output files in the same format for Sanger and NGS, which make output handling and further downstream analysis more efficient. The output files are:

  1. 1.

    QC summary table:

The QC summary table (Fig. 3a) consists of one line for each sample/patient including references to the two other patient specific output files (QC report and FASTQ file). The QC summary table contains experimental, patient, and sequencing information along with QC summary, number of reads and average GC content before and after quality control and filtering.

Fig. 3
figure 3

ClinQC final output. a QC summary table generated for each run, which includes experimental, patient, sequencing and QC information, one row for each sample/patient, (b) QC report generated by FASTQC before (left) and after (right) quality control for each sample/patient and linked in summary table, (c) FASTQ files with high quality reads for each sample/patient and linked in summary table

  1. 2.

    QC report file:

After quality trimming and filtering, an extensive and intuitive quality report is generated in HTML format by using the widely used FASTQC [7] tool. It generates various useful plots (i.e. read base quality, read length distribution, overrepresented sequences and sequence duplication levels) to get a detailed view of the quality of sequencing data. ClinQC generates two QC reports for each patient/sample before QC (Fig. 4a) and after QC (Fig. 4b), which can be used for direct comparison. These two QC report HTML files are linked in the variant summary table.

Fig. 4
figure 4

ClinQC quality control report generated by FASTQC. a Per base sequence quality before quality control and (b) per base sequence quality after quality control. ClinQC generates several useful QC plots for each patient’s FASTQ file before and after quality control. This feature enables to directly compare the data quality improvements and the number of filtered reads before and after quality control

  1. 3.

    FASTQ file with high quality reads:

After all file preprocessing, quality filtering and trimming steps are completed, ClinQC creates a Sanger encoded FASTQ file with high quality reads for each patient/sample (Fig. 2c). This file can be directly used in further down-stream analysis (e.g., mutation screening, genome assembly and metagenomic).

Availability, installation and usage

ClinQC is an open-source pipeline and freely accessible for all researchers and clinicians. For non-expert users, ClinQC is available as a fully configured Virtual Machine (VM) accessible at https://sourceforge.net/p/clinqc/wiki/Virtual_Machine/, which is readily available and can be run on Virtual box (https://www.virtualbox.org/) without any installation and configuration requirement.

To use ClinQC outside of the Virtual Machine, we provide the source code of ClinQC along with pre-compiled third party tools/software separately for Linux and MacOS operating systems available from https://sourceforge.net/projects/clinqc/files/ClinQC_v1.0-linux.zip and https://sourceforge.net/projects/clinqc/files/ClinQC_v1.0-macos.zip respectively.

An extensive user manual (Additional file 5) is available from https://sourceforge.net/p/clinqc/wiki and a PDF version can be downloaded from https://sourceforge.net/projects/clinqc/files/ClinQC-Manual.pdf. The user manual includes description of the parameter file for NGS and Sanger, references to test datasets, dependency installation, ClinQC usage, and a detailed step-by-step description of the ClinQC pipeline. A test data set for Sanger, Illumina, 454 and Ion Torrent is available at https://sourceforge.net/projects/clinqc/files/test_data.zip

Flexibility and reusability

ClinQC has been designed with focus on usability and organization of multiple sample/patient data with little manual task and user interaction for expert and non-expert users. Users just need to prepare one Target file (Additional file 3) and one option file (Additional files 1 and 2) for hundreds to thousands of sample/patient and are then able to run the whole pipeline with a single command. Since all input parameters and files are given in the Target file and the Option file, it would be extremely useful to preserve and store these files for each run for further use and reproducibility.

An integrated pipeline

ClinQC is designed to cover a wide range of applications by supporting three NGS platforms as well as traditional Sanger sequencing trace files. In most clinical and genomic research studies, Sanger sequencing is being used in combination with a NGS sequencing platform for efficient and accurate mutation screening in a cost effective and time bound manner. Thus, providing Sanger and NGS data analysis under a unified single platform will help users with the analysis of sequencing data generated from one or more experiments.

The consistency of the paired-end relationship between forward and reverse read is essential to many subsequent analysis steps such as read mapping, variant calling or genome assembly. Therefore, ClinQC reads both read files simultaneously to maintain the paired-end consistency. In ClinQC, if one read of a pair is lost due to trimming or filtering, the corresponding pair is also excluded from the final data sets.

Scalability

ClinQC is designed to handle sequencing data ranging from single-gene to whole genome sequencing. The software can be used to analyze several patient data in parallel from both Sanger and NGS sequencing experiments. Furthermore, it employs a multiprocessing concept to use all CPUs to process data efficiently in parallel. We show in our benchmark study (Tables 1 and 2) that the execution time scales almost linear with the increase of reads.

Table 1 Benchmark of ClinQC with Illumina Paired-end data. We used 2x100bp paired-end reads with multiple sizes ranging from 1 million to 100 million pair reads. The execution time is measured in minutes
Table 2 Benchmark of ClinQC with Sanger sequencing trace files. We used 1000 trace files in AB1 format. The read lengths were ranging between 400 and 1000 base pairs. We randomly sampled 1000 files in multiple test data sets ranging from 10 files to 1000 files. The execution time is measured in minutes

Performance evaluation

To demonstrate the performance of ClinQC we used publically available Illumina paired-end Whole Genome Sequencing (WGS) reads of CEPH/UTAH female individual (HapMap ID: NA12878) data (~420 million read pairs, 2x100bp) produced by Illumina HiSeq 2000. We downloaded seven sequencing runs from NCBI SRA (SRA ID: SRP048874). After pooling the reads from 7 libraries, we randomly sampled multiple datasets ranging in size from 1 million to 100 million. As shown in Table 1, the execution time of ClinQC time scales almost linear with the increase of read pairs.

We further evaluated the performance of ClinQC with Sanger sequencing trace files. We used 1000 trace files in AB1 format generated from the sequencing of human gene BRCA1 from 100 patients. We randomly picked trace files ranging from 10 to 1000, which could be processed in ~10 s and ~11 min, respectively. We ran all benchmarks on a Linux server (Ubuntu 12.0.4 LTS with 4 CPU, 8 GB RAM).

ClinQC future direction

The current version of ClinQC will be extremely useful for NGS data analysis targeting whole genome sequencing, whole exome sequencing, targeted sequencing and metagenomic studies. Due to its capability to perform data analysis on Sanger data as well, it provides an integrated solution for the combined analysis of these complementary technologies. As ClinQC is an ongoing project, we will address other forthcoming quality challenges. Furthermore, we will also extend the tool to support new sequencing platforms. Currently, ClinQC is not advised for the analysis of RNA-Seq data.

Comparison with existing tools

ClinQC provides a one-stop solution to perform various quality control steps. A comparison of the most important features of ClinQC to other available tools is given in Table 3. Many of the existing tools do not offer organization of data, parallel analysis of multiple sample/patient and none supports Sanger sequencing data analysis. Moreover, unlike other existing tools, ClinQC provides simple input options, which can be prepared as text file and allow running the whole pipeline without any manual intervention. ClinQC can be run on any operating system using a Virtual Machine, which is not offered by any other tool.

Table 3 Comparison of various features between ClinQC and QC tools

Conclusions

ClinQC is an integrated, automated, flexible and user-friendly tool for quality control in clinical research. It supports three major NGS sequencing technologies including Illumina, 454 and Ion Torrent along with Sanger sequencing. ClinQC offers full flexibility, accuracy and reproducibility. All input parameters can be customized in the “ClinQCOptions” configuration file. It is a one-stop solution to run from raw sequence reads and trace files to high quality FASTQ files with Sanger quality encoding. This tool can be easily integrated in any downstream analysis pipeline for, e.g., mutation screening. In summary ClinQC can be used to analyze 1) Sanger and NGS data together, 2) all quality control parameters can be customized for different sequencing data, 3) thousands of datasets / patients / samples can be analyzed in a single run, 4) paired-end, single-end reads and mixed reads generated from Illumina, 454 and Ion Torrent can be analyzed simultaneously in a single run. ClinQC excels over existing tools and software for better usability, multiple data handling, Sanger sequencing data analysis and common input output model for Sanger and NGS data analysis.

Availability and requirements

Project name: ClinQC

Project home page: https://sourceforge.net/projects/clinqc

Operating system(s): All Unix operating system

Programming language: Python 2.7.9

Other requirements: Perl 5.12 or higher, Java 1.7 or higher

License: LGPL

Any restrictions to use by non-academics: None