cFinder: definition and quantification of multiple haplotypes in a mixed sample

Niklas, Norbert; Hafenscher, Julia; Barna, Agnes; Wiesinger, Karin; Pröll, Johannes; Dreiseitl, Stephan; Preuner-Stix, Sandra; Valent, Peter; Lion, Thomas; Gabriel, Christian

doi:10.1186/s13104-015-1382-7

cFinder: definition and quantification of multiple haplotypes in a mixed sample

Technical Note
Open access
Published: 07 September 2015

Volume 8, article number 422, (2015)
Cite this article

Download PDF

You have full access to this open access article

BMC Research Notes Aims and scope Submit manuscript

cFinder: definition and quantification of multiple haplotypes in a mixed sample

Download PDF

Norbert Niklas¹,
Julia Hafenscher¹,
Agnes Barna¹,
Karin Wiesinger¹,
Johannes Pröll¹,
Stephan Dreiseitl²,
Sandra Preuner-Stix³,
Peter Valent⁴,
Thomas Lion³ &
…
Christian Gabriel¹

4175 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Next-generation sequencing allows for determining the genetic composition of a mixed sample. For instance, when performing resistance testing for BCR-ABL1 it is necessary to identify clones and define compound mutations; together with an exact quantification this may complement diagnosis and therapy decisions with additional information. Moreover, that applies not only to oncological issues but also determination of viral, bacterial or fungal infection. The efforts to retrieve multiple haplotypes (more than two) and proportion information from data with conventional software are difficult, cumbersome and demand multiple manual steps.

Results

Therefore, we developed a tool called cFinder that is capable of automatic detection of haplotypes and their accurate quantification within one sample. BCR-ABL1 samples containing multiple clones were used for testing and our cFinder could identify all previously found clones together with their abundance and even refine some results. Additionally, reads were simulated using GemSIM with multiple haplotypes, the detection was very close to linear (R² = 0.96). Our aim is not to deduce haploblocks over statistics, but to characterize one sample’s composition precisely. As a result the cFinder reports the connections of variants (haplotypes) with their readcount and relative occurrence (percentage). Download is available at http://sourceforge.net/projects/cfinder/.

Conclusions

Our cFinder is implemented in an efficient algorithm that can be run on a low-performance desktop computer. Furthermore, it considers paired-end information (if available) and is generally open for any current next-generation sequencing technology and alignment strategy. To our knowledge, this is the first software that enables researchers without extensive bioinformatic support to designate multiple haplotypes and how they constitute to a sample.

eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data

A graph-based algorithm for estimating clonal haplotypes of tumor sample from sequencing data

Article Open access 31 January 2019

GASOLINE: detecting germline and somatic structural variants from long-reads data

Article Open access 27 November 2023

Background

The onset of next-generation sequencing (NGS) platforms in many laboratories enables researchers to quantify the genetic composition of a specific sample. Assuming that each cell of this sample produces the same amount of DNA a relative quantification can be deduced from the overall amount of sequencing reads [1].

454 sequencing allows for long reads up to 1000 bp, reading through most target regions where phasing is necessary. 454’s pyrosequencing strategy reads one template DNA molecule in either forward or reverse direction [2]. Illumina can also do paired end sequencing, where every DNA fragment is read in both directions and can be matched, even if there is no overlap. It is evident that both technologies allow for the detection of haplotypes although different sequencing strategies must be applied.

However, the detection of compound mutations as in BCR-ABL1 for resistance testing and the quantification of the found clones have to be performed manually [3]. This process is not just time-consuming but also error prone. To our knowledge there is no software yet that can effectively detect and exactly quantify multiple haplotypes without extensive bioinformatics assistance. There are some tools (e.g. HaplotypeCaller of GATK) that can establish phasing but these tools provide no ability to detect more than two clones or calculate quantities. GS Amplicon Variant Analyzer (AVA, 454 Life Sciences, Branford, CT) is the only software that allows the definition of combined variants manually, with additional manual calculation steps. CLCbio’s Genomics Workbench 7.0 (GWB, CLCbio, Aarhus, Denmark) is able to link mutations in the same coding triplet only.

Ultra-deep sequencing with a coverage of multiple 10 k reads per target is necessary to yield accurate quantities [4]. Because exact phasing of whole genome or whole exome data is impossible and approximation just feasible statistically [5], we assume a rather small size (up to 10 k bp) as a region of interest in the human genome (trade-off between region size and coverage) [6].

cFinder is capable of analysing sequence data independent of any sequencing platform, organism and library preparation strategy. Prerequisites are a reference sequence and a medical or biological expert. Moreover, a graphical user interface provides convenient usage, so that our tool can be applied without bioinformatics or scripting knowledge. Our aim is not to sequence multiple samples and to deduce haploblocks (multiple variants passed on together by meiosis) over statistical occurrences in a population, but to characterize one sample and its composition precisely.

Technicalities

The software can be run on any desktop computer with.NET 4.0 installed, memory usage and runtime are both reduced to a minimum. Assuring the highest degree of flexibility, no mapping or alignment algorithm is directly implemented (see Additional file 1).

Thus, the software is able to work with the most common mapping formats ACE and SAM (unpadded and padded) [7]. An annotation file (genebank format *.gb or user defined) allows the calculation of amino acid and coding sequence changes.

The architecture of the software allows for loading an enormous amount of reads (coverage), just affecting loading time. For 4 M simulated reads in paired end mode, cFinder consumed 811 MB memory and took 11.2 min to load the data, calculation was done in 11 s; for 8 M simulated reads loading took 23.8 min and calculation 22 s; conducted with 2.67 GHz and 6 GB RAM.

The output of the software is a list of connected variants (haplotypes) occurring with the same pattern on multiple reads. Additionally the coding region change and the amino acid change are reported. Hence, it can be investigated what variants occur together. The absolute number of reads with this variant combination is displayed together with the calculated percentage. This information may be exported in csv or MS Excel format.

Usage and workflow

All tasks after the initial loading step run in linear time (target region size or coverage), the workflow is summarized in Fig. 1.

The software parses the mapping file and associates found haplotypes per read. At this step all variations (including sequencing errors) are used to create the haplotypes. Consecutive variants following each other are reported in a combined and single format, likewise are variants in one coding triplet.

Found variants are displayed and filtering is feasible. Subsequently, the user may select relevant variants. These can be designated for a specific genus of bacteria or variants that are responsible for resistance against medication [8], e.g. variant definitions from COSMIC database can be loaded.

The previously stored haplotypes are recalculated based on the selected variants and reported together with their percentage. Haplotypes as well as variants can be exported; hence, monitoring a patient at different time points is possible.

Methods

The percentage of a haplotype is calculated with

$$percentage = \frac{haplotypeHits}{{\mathop \sum \nolimits_{i}^{variants} {\text{coverage}}(variants_{i} )}},$$

the number of reads with this haplotype’s variant composition divided by the coverage at these positions. Coverage might vary over the reference sequence. Reads not covering the variant positions will be counted as wild type (wt), hence the sum of all percentages will not sum up to 100 % and wt percentage is estimated from detected clones ($1 - \mathop \sum \nolimits percentages$, minimum 0).

Only sequence reads exactly matching the variants of the assigned haplotype are counted. There is no further statistics applied to the found haplotypes, only the extraction of all occurrences in a highly efficient way. Forward and reverse fragments of paired-end reads (if available) are connected to establish a haplotype. Ambiguous bases in a paired-end overlap are not counted as a variant (by definition one of them is a sequencing error).

Overlapping amplicons

Specialized amplicon designs have to be developed when read length is shorter than region of interest (see Fig. 2). If this is not uniformly covered by design, the amplicon positions must be loaded (to perform correct calculations). For these amplicon designs the user finds a specialized checkbox “Infer Relationship(s)” where haplotypes scattered over multiple amplicons are again reconnected. The algorithm creates a symmetric, unweighted graph with nodes representing variants and edges representing an occurrence on (at least) one sequence. In this graph, the largest subgraph is determined where all nodes are connected with all other nodes in the subgraph (finding this graph is referred to as maximum clique problem in graph theory). The subgraph defines a new haplotype, and readcounts for contributing fragments are recalculated. Fragments and connections with no reads left are removed and the next largest subgraph is searched. Since the maximum clique problem is NP-complete [9], this is the only computationally expensive task after loading. It is limited to 15 rounds and 20 variants to be accomplishable on a desktop computer. The output for an overlapping amplicon design does not differ from the above described output. Concerning the connections between variants, some might be detected that cannot be found on reads. It should be noticed that this combinatorial task is complex and sophisticated and there are cases where no connection can be achieved at all, especially with high number of variants and overlapping subgraphs.

Results and discussion

Next-generation sequencing data generated from long-range BCR-ABL1 sequencing was utilized for evaluation and testing [8]. The original data consist of sequencing one 933 bp amplicon for eleven samples with 29 clones detected by manual investigation (using 454 GS FLX+ technology, AVA, CLC GWB and manual calculation of clones and percentages), ten samples were used for subsequent analysis (see also Additional file 2). Approval of an ethics committee is not required for this analysis. Alignments of individual reads from AVA and CLC GWB using the Large Gap Mapper tool (settings m2, i3, d1, s0.7, l0.7) were loaded and annotated in our cFinder, and both compared to previous (manual) results from [8]. Clinically relevant variants were selected and clone detection yielded 38 clones in 10 samples. All previously detected clones could be confirmed and clones with a percentage greater or equal to 1 % were investigated. Correlations of clones’ percentages with manual data were 0.973 for AVA as well as for CLC alignments and correlations of counted number of reads were 0.995 and 0.993 for AVA and CLC, respectively.

Since alignment of a large 540 bp deletion (well-known for the BCR-ABL fusion product) in sample 8 is not possible by AVA, it was just analyzed with the alignment created by CLC. All clones with a percentage greater or equal to 1 % were investigated.

A detailed comparison is presented in detail in Table 1. For four samples more clones were detected, representing sub clones of the manually defined ones. In one of them, this leads to reduced percentages of previously reported clones. Just five clones differed by more than 10 percentage points, although a similar number of reads was reported (different calculation).

Table 1 Comparison of alignment methods and cFinder to manual approach

Full size table

Two clones fell below the 1 % limit with the improved calculation and would not be treated as real clones. The results generated with AVA and CLC differ for one clone only.

Deviation from manual results is evident due to different definition of compound mutations and calculation ($n = 2^{numVariants}$ possible combinations, where numVariants is the number of variants of interest). Manual calculation did not take coverage at variant position into account, while AVA does not output the number of reads concerning sub clones accurately. Our software tool detects occurring haplotypes automatically, counting each read just once and calculates the frequencies adjusted to the coverage at the variants position.

No real data were available for a multi-amplicon design, therefore two artificial alignments were created (Fig. 3). Our software managed to infer the haplotypes scattered over multiple amplicons and extract their correct clones and counts (Table 2).

Table 2 Test of simulated data with overlapping amplicons

Full size table

Besides, we used GemSIM to simulate 100 k reads (l = 600) for a short (924 bp) fragment of E. coli using the profile for 454 data [10]. In total 26 test cases had 133 different haplotypes (min. 3, max. 8 haplotypes with each up to 9 variants, avg. 4.2 variants). We included multiple clones (haplotypes) that were created by the GemSIM software (GemHaps.py) along with sequencing errors. After alignment with CLC GWB the cFinder detected all haplotypes. Using the absolute number of reads detected the estimated percentages correlated with R² = 0.96 that is nearly linear. The deviation can be related to the high number of errors included in the 454 profile. Figure 4 displays a scatterplot visualizing the simulated and detected percentages.

Conclusion

cFinder is easy to use and quickly identifies and quantifies clones in a mixed sequencing result this discriminates the cFinder from other software products. Overall, all clones and their frequencies are designated with a minimum of user interaction. The tool can be used for data analysis for research as published in [8]. Both alignment types (AVA and CLC) showed a similar performance, supported the previous manual findings, and could even refine them in some cases. The simulation experiment confirms the high accuracy. Thus, our cFinder is completely independent from any alignment tool or settings, allowing for a maximum of flexibility for a specific application.

Availability and requirements

Project name: cFinder.
Availability (including test data): http://sourceforge.net/projects/cfinder/.
Operating system(s): Windows.
Programming language: C#.
Other requirements: .NET 4.0 or higher.
License: GNU GPL v3.0
Any restrictions for commercial usage: license needed.

Abbreviations

NGS:: next-generation sequencing
DNA:: deoxyribonucleic acid
AVA:: GS Amplicon Variant Analyzer
GS:: genome sequencer
GWB:: genomics workbench
COSMIC:: catalogue of somatic mutations in cancer
SAM:: sequence alignment/map format

References

Ladetto M, Bruggemann M, Monitillo L, Ferrero S, Pepin F, Drandi D et al. Next-generation sequencing and real-time quantitative PCR for minimal residual disease detection in B-cell disorders. Leukemia. 2013;28(6):1299–307.
Article PubMed Google Scholar
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80.
PubMed Central CAS PubMed Google Scholar
Khorashad JS, Kelley TW, Szankasi P, Mason CC, Soverini S, Adrian LT, et al. BCR-ABL1 compound mutations in tyrosine kinase inhibitor-resistant CML: frequency and clonal relationships. Blood. 2013;121:489–98.
Article PubMed Central CAS PubMed Google Scholar
Beerenwinkel N, Zagordi O. Ultra-deep sequencing for the analysis of viral populations. Curr Opin Virol. 2011;1:413–8.
Article CAS PubMed Google Scholar
Kuleshov V, Xie D, Chen R, Pushkarev D, Ma Z, Blauwkamp T et al. Whole-genome haplotyping using long reads and statistical methods. Nat Biotechnol. 2014;32(3):261–6.
Article PubMed Central CAS PubMed Google Scholar
Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014;15:121–32.
Article CAS PubMed Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Article PubMed Central PubMed Google Scholar
Kastner R, Zopf A, Preuner S, Proll J, Niklas N, Foskett P, et al. Rapid identification of compound mutations in patients with Philadelphia-positive leukaemias by long-range next generation sequencing. Eur J Cancer. 2014;50:793–800.
Article CAS PubMed Google Scholar
Randy C, Panos MP. An exact algorithm for the maximum clique problem. Oper Res Lett. 1990;9:375–82.
Article Google Scholar
McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genom. 2012;13:74.
Article Google Scholar

Download references

Authors’ contributions

NN implemented part of the software, performed analysis and wrote the manuscript. JH wrote source code and performed tests. AB and KW performed manual analysis and critically reviewed the manuscript. SD significantly contributed to the infer relationships algorithm. JP and CG designed the concept and critically reviewed the manuscript. TL and SP contributed to the testing phase. PV contributed to the concept of the algorithm. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the Austrian Science Fund (FWF), SFB Grants F4704-B20 and F4705-B20.

Compliance with ethical guidelines

Competing interests The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Red Cross Transfusion Service for Upper Austria, Krankenhausstraße 7, 4017, Linz, Austria
Norbert Niklas, Julia Hafenscher, Agnes Barna, Karin Wiesinger, Johannes Pröll & Christian Gabriel
University of Applied Sciences Upper Austria, Softwarepark 11, 4232, Hagenberg, Austria
Stephan Dreiseitl
Children’s Cancer Research Institute, Vienna, Austria
Sandra Preuner-Stix & Thomas Lion
Division of Hematology and Hemostaseology, Department of Medicine I, Ludwig Boltzmann Cluster Oncology, Medical University of Vienna, Vienna, Austria
Peter Valent

Authors

Norbert Niklas
View author publications
You can also search for this author in PubMed Google Scholar
Julia Hafenscher
View author publications
You can also search for this author in PubMed Google Scholar
Agnes Barna
View author publications
You can also search for this author in PubMed Google Scholar
Karin Wiesinger
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Pröll
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Dreiseitl
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Preuner-Stix
View author publications
You can also search for this author in PubMed Google Scholar
Peter Valent
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Lion
View author publications
You can also search for this author in PubMed Google Scholar
Christian Gabriel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norbert Niklas.

Additional files

Additional file 1:

Archive containing the cFinder software. Includes executable, library and settings files.

Additional file 2:

Archive containing test data. Include annotations and exemplary alignment files from AVA and CLC for sample 5, as well as simulated test reads for overlapping amplicons.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Niklas, N., Hafenscher, J., Barna, A. et al. cFinder: definition and quantification of multiple haplotypes in a mixed sample. BMC Res Notes 8, 422 (2015). https://doi.org/10.1186/s13104-015-1382-7

Download citation

Received: 11 July 2014
Accepted: 24 August 2015
Published: 07 September 2015
DOI: https://doi.org/10.1186/s13104-015-1382-7

cFinder: definition and quantification of multiple haplotypes in a mixed sample