Hecaton: reliably detecting copy number variation in plant genomes using short read sequencing data
Copy number variation (CNV) is thought to actively contribute to adaptive evolution of plant species. While many computational algorithms are available to detect copy number variation from whole genome sequencing datasets, the typical complexity of plant data likely introduces false positive calls.
To enable reliable and comprehensive detection of CNV in plant genomes, we developed Hecaton, a novel computational workflow tailored to plants, that integrates calls from multiple state-of-the-art algorithms through a machine-learning approach. In this paper, we demonstrate that Hecaton outperforms current methods when applied to short read sequencing data of Arabidopsis thaliana, rice, maize, and tomato. Moreover, it correctly detects dispersed duplications, a type of CNV commonly found in plant species, in contrast to several state-of-the-art tools that erroneously represent this type of CNV as overlapping deletions and tandem duplications. Finally, Hecaton scales well in terms of memory usage and running time when applied to short read datasets of domesticated and wild tomato accessions.
Hecaton provides a robust method to detect CNV in plants. We expect it to be of immediate interest to both applied and fundamental research on the relationship between genotype and phenotype in plants.
KeywordsCopy number variation Structural variation Plant adaptation Machine learning
Copy number variation or copy number variant
Single nucleotide polymorphism
Whole genome sequencing
Phenotypic variation between individuals of the same plant species is caused by a host of different types of genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions, and larger structural variation. One major class of structural variation is copy number variation (CNV), which is defined as deletions, insertions, tandem duplications and dispersed duplications of at least 50 bp. CNV comprises a large part of the genetic variation found within plant populations and is thought to play a key role in adaptation and evolution . One clear example of such adaptive evolution is presented by the weed species Amaranthus palmeri, which rapidly became resistant to a widely used herbicide through amplification of the EPSPS gene, resulting in increased expression . Similar relationships between CNV and adaptation were found in domesticated crop species , indicating that CNV may offer a pool of genetic variation that can be used to improve crop cultivars.
Given the increasing interest of the plant research community in CNV [1, 3, 4], the question arises whether current methods accurately detect copy number variants (CNVs) in plants. Currently, CNVs are mainly analyzed by whole genome sequencing (WGS). After a sample of interest has been sequenced and the resulting sequencing data has been aligned to a reference genome, computational methods can extract various signals from the alignments to detect CNV between the sample and the reference . While long reads are better suited for detecting CNVs than short paired-end reads [6, 7], sequencing data of plants is still commonly generated using short read sequencing platforms, due to their far lower cost.
Although current state-of-the-art CNV detection algorithms generally perform well when applied to human datasets , the typical complexity of plant data likely introduces false positive calls. First, reference genome assemblies of plants generally contain a larger number of gaps than the human reference genome, as plant genomes are difficult to assemble due to their repetitive nature. Yet, the genomic sequence contained in such gaps is still present in WGS data of samples. The reads representing this sequence generally share high similarity with other assembled regions of the reference, to which they are incorrectly aligned as a result. Second, sampled plant genomes can differ significantly from reference genome assemblies, particularly if samples represent out-bred or natural accessions. If a region in a sample genome has undergone several mutations relative to the reference, reads sequenced from this region may map to a different region than the one it is syntenic to. This is particularly likely to happen if the region that the reads originated from is highly repetitive. Third, several CNV detection algorithms erroneously process alignments resulting from dispersed duplications . We expect that this issue introduces a significant number of false positives when working with plant data, as duplication and transposition of genomic sequences is considered to be one of the main drivers of adaptive evolution in plants .
To enable reliable and comprehensive detection of copy number variants in plant genomes, we developed Hecaton, a novel computational workflow that combines several existing detection methods, specifically tailored to detect CNV in plants. Combining methods generally results in higher recall and precision than using a single tool [11, 12], as the recall and precision of individual tools varies among different types and sizes of CNVs, depending on their algorithmic design . However, determining the optimal strategy to integrate different methods is not straightforward. A suboptimal integration approach may yield only a small gain of precision, while significantly decreasing recall [8, 13]. Hecaton tackles this challenge in two ways. First, it makes use of a custom post-processing step to correct erroneously detected dispersed duplications, which are systematically mispredicted by some state-of-the-art tools. Second, it utilizes a machine-learning model which classifies detected calls as true and false positives by leveraging several features describing a detected CNV call, such as its type and size, along with concordance among the callers used to detect it. In this paper, we demonstrate that Hecaton outperforms existing individual and ensemble computational CNV detection methods when applied to plant data and provide an example of its utility to the plant research community.
Selected CNV calling tools
To maximize the performance of Hecaton, we combine predictions of a diverse set of popular, open-source tools that complement each other in terms of the signals and strategies used to call CNVs. The selected tools include Delly  (version 0.7.8), GRIDSS  (version 1.8.1), LUMPY  (version 0.2.13), and Manta  (version 1.4.0). Delly detects CNVs using discordantly aligned read pairs and refines the breakends of detected events using split reads. LUMPY improves upon this method by integrating both of these signals to detect CNVs, as opposed to using them sequentially. Manta and GRIDSS further enhance this strategy by performing local assembly of sequences flanking breakends identified by discordantly aligned read pairs and split reads. We considered including CNVnator  (version 0.3.2), Control-FREEC  (version 10.4), and Pindel  (version 0.2.5b9). Pindel was dropped after showing an excessively long run time when applied to simulated high coverage datasets. CNVnator and Control-FREEC were excluded as they performed poorly during evaluations (Additional File 1: Figure S1).
Implementation of hecaton
Stage 1: Calling
The calling stage takes paired-end Illumina WGS data of a sample of interest and a reference genome as input and calls CNVs between the sample and reference using four different tools. First, it aligns the sequencing data to the reference using the Speedseq pipeline  (version 0.1.2) with default parameters. This pipeline utilizes bwa mem  (version 0.7.10-r789) to align reads, SAMBLASTER  (version 0.1.22) to mark duplicates and Sambamba  (version 0.5.9) to sort and index BAM files. The resulting sorted BAM file is processed by Delly, LUMPY, Manta and GRIDSS to call CNVs. Each of these tools is run with default parameters, except for the number of supporting reads required by LUMPY and Manta for a CNV to be included in the output (lowered to 1 to maximize recall). Delly and GRIDSS do not apply any filters by default. The final output of the calling stage consists of four VCF files containing CNVs, one for each tool.
Stage 2: Post-processing
The post-processing stage of Hecaton serves three purposes. First, it provides an automated method to process the output files of different tools using a common representation, which is necessary to properly integrate them. Second, it corrects dispersed duplications that have been detected by CNV tools as overlapping deletions and tandem duplications by mistake. Third, it merges calls produced by different tools that likely correspond to the same CNV event.
The common representation of CNVs used by Hecaton is based on the concept that each structural variant can be represented as a set of novel adjacencies. A novel adjacency is defined as a pair of bases that are adjacent to each other in the genome of a sample of interest, but not in the genome of the reference to which the sample is compared. Bases that are linked by a novel adjacency are called breakends and two breakends that corresponds to the same adjacency are referred to as mates. Although Delly, GRIDSS, LUMPY, and Manta all generate a VCF file as output, the way in which CNV calls and the evidence supporting them are represented in this file is different for each tool. For example, the output of Delly, LUMPY, and Manta contains both CNVs and breakends, while that of GRIDSS solely consists of breakends that need to be converted to CNVs by the user.
To convert the output of each tool to a common CNV format and correct erroneous dispersed duplications, Hecaton reclassifies the adjacencies underlying the CNV calls produced by each tool. First, it infers and collects adjacencies from all sets of CNVs generated during the calling stage. For example, it represents deletions as a single adjacency containing two breakends positioned on the 5’ and 3’ end of the deleted sequence. Next, it clusters adjacencies generated by the same tool of which the breakpoints are located within 10 bp of each other on either the 5’ end or 3’ end, as these are likely to be part of the same variant. Finally, it converts each cluster to a deletion, insertion, tandem duplication, or dispersed duplication, based on the relative positions of the breakends and the orientation of the sequences that are joined in a cluster. Deletions, insertions, and tandem duplications are represented by single adjacencies, while dispersed duplications are represented by two (Additional File 1: Figure S2). As the objective of Hecaton is to detect CNV and not any other form of structural variation, it excludes any set of adjacencies that cannot be classified as one of these four types from further analysis. However, Hecaton can be extended to support additional types of structural variation if needed.
Hecaton collapses calls produced by different tools that are likely to correspond to the same CNV event. Calls are merged if they fulfill all of the following conditions: they are of the same type; their breakpoints are located within 1000 bp of each other on both the 5’ and 3’ end; they share at least 50% reciprocal overlap with each other (does not apply to insertions); and the distance between the insertion sites is no more than 10 bp (only applies to dispersed duplications and insertions). The regions of the merged calls are defined as the union of the regions of the “donor" calls. For instance, one call that covers positions 12-30 and one call that covers positions 14-32 are merged into a call covering positions 12-32. The number of discordantly aligned read pairs and split reads supporting a merged call are both defined as the median of the numbers of the donor calls. The final result of the post-processing stage is a single BEDPE file containing all merged calls.
Stage 3: Filtering
In the filtering stage, Hecaton applies a machine-learning model to remove erroneous CNV calls. First, it generates a feature matrix that represents the set of merged calls. The rows of the matrix correspond to CNV calls and the columns correspond to features (Additional file 2: Table S1), which are extracted from the INFO and FORMAT fields of the VCF file containing the calls.
Hecaton classifies calls as true or false positives using a random forest model. We chose to implement this particular type of machine-learning model, as it outperformed a logistic regression model and a support vector machine. The model assigns a probabilistic score to each merged call based on the set of features defined for it. These scores are posterior probability estimates of calls being true positives and range between 0 and 1. Calls with scores below a specified user-defined cutoff are dropped, producing a BEDPE file containing the final output of Hecaton.
To obtain a random forest model that strikes a good balance between recall and precision, we trained it using a set of CNVs detected from real WGS data for which the labels (true or false positive) were known, based on long read data (see Additional file 3: Supplementary Methods for details on the validation procedure). We did not include CNVs obtained from simulated data in the ground truth set, as the recall and precision attained by Delly, LUMPY, Manta, and GRIDSS on such data generally does not accurately reflect their performance in real scenarios. For example, LUMPY and Manta obtained almost perfect precision when we applied them to simulated datasets with minimum filtering, if dispersed duplications were excluded from the simulation. They showed significantly lower precision in previous benchmarks when applied to real human data [16, 17].
The training and testing set were constructed by running the calling and post-processing stages of Hecaton on Illumina data of an Arabidopsis thaliana Col-0–Cvi-0 F1 hybrid and a sample of the Japonica rice Suijing18 cultivar (Additional file 2: Table S2). We detected CNVs in these samples relative to the A. thaliana Col-0 (version TAIR10) and Oryza sativa Japonica (version IRGSP-1.0) reference genome. As we aimed to maximize the performance of the model for low coverage datasets in particular, we subsampled these datasets to 10x coverage using seqtk . Calls were labeled as true or false positives using long read data of the same samples (See Additional file 3: Supplementary Methods for details). To obtain a test set, we held out calls located on chromosomes 2 and 4 of A. thaliana and chromosomes 6, 10, and 12 of O. sativa, using the remaining calls as the training set. In order to obtain a model that generalizes to multiple plant species, one single model was trained using both Col-0–Cvi-0 and Suijing18 calls. The training set contained 4983 deletions, 393 insertions, 604 tandem duplications and 106 dispersed duplications, while the test set contained 2291 deletions, 174 insertions, 292 tandem duplications and 44 dispersed duplications.
We implemented the random forest model in Python using the scikit-learn package  (version 0.19.1). The hyperparameters of the model (n_estimators, max_depth, and max_features) were selected by doing a grid search with 10-fold cross-validation on the training set, using the accuracy of the model on the validation data as optimization criterion.
The performance of Hecaton was compared to that of current state-of-the-art tools using short read data simulated from rearranged versions of the Solanum lycopersicum Heinz 1706 reference genome of tomato ; the testing set constructed from A. thaliana Col-0–Cvi-0 and rice Suijing18; and real short read data of A. thaliana Ler, maize B73, and several tomato samples (Additional file 2: Table S2). We determined the recall and precision of tools with two validation methods that use long read data: VaPoR  and Sniffles . See Additional file 3: Supplementary Methods for full details.
Results and discussion
We present Hecaton, a novel computational workflow to reliable detect CNVs in plant genomes (Fig. 1). It consists of three stages. In the first stage, it aligns short read WGS data to a reference genome of choice and calls CNVs from the resulting alignments using Delly, GRIDSS, LUMPY, and Manta, four state-of-the-art tools that complement each other in terms of their methodological set-up. In the second stage, Hecaton corrects dispersed duplications that are erroneously represented by these tools as overlapping deletions and tandem duplications. In the final stage, Hecaton filters calls by using a random forest model trained on CNV calls validated by long read data. Below, we first describe how the design of Hecaton allows it to outperform the current state-of-the-art and then we will present an application of Hecaton to crop data.
Hecaton accurately detects dispersed duplications
Dispersed duplications are defined as duplications in which the duplicated copy is found at a genomic region that is not adjacent to the original template sequence. Such variants are frequently found in plants, as plant genomes typically contain a large number of class I transposable elements that propagate themselves through a “copy and paste" mechanism. While dispersed duplications may play an important role in the adaptive evolution of plants , they can also introduce a significant number of false positives, if they are not taken into account while calling CNVs. To show the impact of this problem, we applied Delly, GRIDSS, LUMPY, and Manta to short read data simulated from modified versions of the S. lycopersicum Heinz 1706 reference genome containing different types of CNVs at known locations.
The performance of the post-processing step improved with coverage (Fig. 3), as it fails to detect dispersed duplications if one or both of the adjacencies resulting from them are missing from the output of Delly, LUMPY, Manta, or GRIDSS. In line with this observation, the post-processing script detected a lower number of dispersed duplications simulated at low allele dosage compared to those simulated at high dosage (Additional file 1: Figure S3), as the effective coverage of variant alleles decreases when they are present in few haplotypes. If only one of the two adjacencies could be detected, the post-processing script classified it as a false positive deletion, false positive tandem duplication, or generic breakend.
Hecaton generally outperforms state-of-the-art cNV detection tools
Intuitively, it makes sense to combine the output of multiple CNV detection tools, as they typically generate complementary results when applied to the same dataset . However, designing a method that optimally integrates tools is not trivial. In a past benchmark, an ensemble strategy that combined tools through a majority vote did not significantly improve upon the best performing individual tool . Here, we demonstrate the benefits of using a machine-learning approach, which aggregates and filters calls based on features including size, type and level of support from different tools. We trained machine-learning models using CNVs detected from 10x coverage short read data of a highly heterozygous A. thaliana Col-0–Cvi-0 sample and a Suijing18 rice sample. The labels (true or false positive) of these CNVs were determined using long read data of the same samples. This approach generated accurate validations of calls detected from the simulated S. lycopersicum Heinz 1706 datasets.
Besides outperforming individual tools, the machine-learning approach employed by Hecaton significantly improved upon current state-of-the-art ensemble methods that are applicable to, but not specifically designed for plant data. It attained a better combination of precision and recall than MetaSV , SURVIVOR , and Parliament2 , three alternative approaches that aggregate the results of different CNV detection tools, when applied to datasets of Col-0–Cvi-0 and Suijing18 (Fig. 4). The poor performance of MetaSV and SURVIVOR sharply contrasts with the good performance they showed in the benchmarks of the publications describing them [31, 32]. One possible reason for this discrepancy could be that both tools were evaluated in these benchmarks using simulated data, which likely does not accurately reflect the distribution of CNVs in real data.
Consistent with previous benchmarks performed with long read data [6, 7], insertions remained difficult to reliably detect using short paired-end Illumina reads in all of our test cases, even after applying the filtering stage of Hecaton. We manually investigated alignments covering tens of false positive insertions in A. thaliana Ler and discovered that they all resulted from alignments that were soft-clipped at the insertion site. These insertions were all reported by Hecaton to have an unknown size. With some of the insertions, the mates of the soft-clipped reads mapped to a different chromosome, indicating that some may be interchromosomal transpositions instead.
The idea of predictor combination has been already applied to improve detection of structural variants from exome sequencing data , detection of somatic single nucleotide variants , inference of gene regulatory networks , and predictive models of breast cancer prognosis . Our results demonstrate that this concept can be used to improve CNV detection as well, contrasting a previous benchmark that combined structural variation methods through a majority vote . This suggests that the aggregation approach used by Hecaton is better suited to deal with the different treatment of each tool of specific types of CNV than a majority vote. A possible reason for this is that the random forest model employed by Hecaton can capture interactions between tools and CNV types to some extent during aggregation, while a majority vote assigns an equal weight to each tool. Such an approach does not work well if most tools are ill-suited to detect a specific type of CNV.
We demonstrated that Hecaton can generalize to plant species beyond those used in its training set (Fig. 5). The performance of Hecaton can be further improved, as it is relatively easy to extend it to include other CNV detection tools or to train new random forest models using additional plant data. Nevertheless, Hecaton has some limitations. First, it has limited recall for dispersed duplications when applied to very low coverage (5x) data. Second, although Hecaton has no upper limit in terms of the size of CNV it can detect, we were not able to evaluate its performance on CNVs that were larger than 1 Mb, as such calls tended to be falsely validated by one of our validation methods, VaPoR (Additional file 1: Figure S6). Third, it is not able to detect insertions with both high recall and precision, a limitation it shares with other CNV detection tools designed to work with short WGS data . Finally, we were not able to robustly assess the performance of Hecaton in polyploid plant species, as we could not find polyploid samples of which both short and long read data were publicly available. We expect that the performance of Hecaton on polyploids should be comparable to the performance reported in this work on diploids, if the polyploid sample does not show strong differences between haplotypes. To deal with additional biases found in more complex polyploid species, it may be worthwhile to obtain ground truth annotations in order to train random forest models specifically tailored to polyploids. Such annotations can be obtained by generating a small set of polyploid samples using both Illumina and PacBio sequencing platforms. Simulated data could serve as ground truth data as well, but we were unable to generate simulated CNVs that accurately represent the distribution of CNVs in real scenarios, a problem encountered in previous work that benchmarked CNV detection tools .
Hecaton provides a scalable method to detect CNVs in plant species
Hecaton running time and memory use
CPU time (h)
Real time (h)
Peak resident set size (Gb)
Although we do not know the true CNVs present in these samples, we estimated that Hecaton attained lower precision in the tomato samples than in the A. thaliana and rice sets. We considered events to be likely false positives if they did not have a clear and uniformly lower (deletions) or higher (duplications) read depth compared to the rest of the chromosome its located on or to its flanking regions, or if they showed excessive (over 1000x) read coverage. Based on these criteria, we estimated that 30% of deletions, 20% of tandem duplications, and 80% of dispersed duplications in the wild accession LYC4 sample were false positives, based on inspection of a random sample of 20 CNVs of each type.
Number of CNVs detected in tomato samples before and after filtering
Read depth filter
Read depth and gap filter
Types of CNV detected in tomato samples after filtering
Overlap of filtered CNVs with repeats, genes, and coding sequences (CDS) in tomato samples
(% of total)
(% of total)
(% of total)
Hecaton is a computational workflow specifically designed to detect CNV from WGS data of plant genomes. It improves upon the performance of current approaches primarily developed for human genomes, indicating that such tools are less suitable to plant data without optimization. In contrast to several state-of-the-art tools, Hecaton correctly detects dispersed duplications. Moreover, the random forest model employed by Hecaton improves upon current approaches in terms of recall and precision. Finally, the running time and memory usage of Hecaton scales well to crop genomes, demonstrating its practical utility to plant research.
We anticipate that Hecaton is of immediate interest to both applied and fundamental research regarding the relationship between genotype and phenotype in plants. CNVs have been linked with several stress-resistant phenotypes in crop species , including frost tolerance in wheat , aluminum tolerance in maize , and boron tolerance in barley . Hecaton can extensively query crop and wild germplasm for resistant loci, that can be characterized and eventually introgressed into elite cultivars. Besides its use in agricultural research, Hecaton may help to answer more fundamental questions regarding the role of CNV, as many characteristics regarding the role of CNV in plant adaptation are still relatively unknown . Such characteristics include how stress affects the rate at which CNVs accumulate, the main molecular mechanisms that govern the creation of CNVs, and the evolutionary dynamics that determine whether CNVs become fixed within a population. Populations of wild and domesticated plant species (such as the 100 Tomato Genome project ) may provide excellent datasets to explore these topics, given that domestication is a fairly recent phenomenon involving artificial selection for a set of well-defined and well-characterized traits.
Availability and requirements
Project name: Hecaton
Project home page: https://git.wur.nl/bioinformatics/hecaton
Operating system(s): Unix
Programming language: Python
Other requirements: Hecaton can be either installed locally or through a Docker image. All of its dependencies are listed on the project home page.
License: GNU AGPLv3
Any restrictions to use by non-academics: No
We thank Carlos de Lannoy and Mehmet Akdel for testing the installation of Hecaton.
RYW, SS, and DDR contributed to the design of Hecaton. RYW implemented Hecaton, evaluated its performance, and applied it to representative crop samples. RYW, SS, and DDR were major contributors in writing the manuscript. All authors read and approved the final manuscript.
This work was supported by the Netherlands Organisation of Scientific Research (NWO) under project grant ALWGR.2015.9. We thank the NWO and private partners Rijk Zwaan Breeding, Bejo Zaden, Genetwister Technologies, Averis Seeds, C. Meijer and HZPC Holland for their financial support through this grant. The funding bodies had no role in the design of the study; the collection, analysis, and interpretation of data; and in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 23.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM; 2013. Preprint at http://arxiv.org/abs/1207.3907. Accessed 23 July 2019.Google Scholar
- 26.Li H. seqtk, Toolkit for processing sequences in FASTA/Q formats; 2012. Available from:https://github.com/lh3/seqtk. Accessed 10th of August 2018.
- 27.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12(Oct):2825–30.Google Scholar
- 33.Zarate S, Carroll A, Krashenina O, Sedlazeck FJ, Jun G, Salerno W, et al. Parliament2: fast structural variant calling using optimized combinations of callers. 2018. Preprint at https://www.biorxiv.org/content/10.1101/424267v1.abstract. Accessed 23 July 2019.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.