NCLcomparator: systematically post-screening non-co-linear transcripts (circular, trans-spliced, or fusion RNAs) identified from various detectors
Non-co-linear (NCL) transcripts consist of exonic sequences that are topologically inconsistent with the reference genome in an intragenic fashion (circular or intragenic trans-spliced RNAs) or in an intergenic fashion (fusion or intergenic trans-spliced RNAs). On the basis of RNA-seq data, numerous NCL event detectors have been developed and detected thousands of NCL events in diverse species. However, there are great discrepancies in the identification results among detectors, indicating a considerable proportion of false positives in the detected NCL events. Although several helpful guidelines for evaluating the performance of NCL event detectors have been provided, a systematic guideline for measurement of NCL events identified by existing tools has not been available.
We develop a software, NCLcomparator, for systematically post-screening the intragenic or intergenic NCL events identified by various NCL detectors. NCLcomparator first examine whether the input NCL events are potentially false positives derived from ambiguous alignments (i.e., the NCL events have an alternative co-linear explanation or multiple matches against the reference genome). To evaluate the reliability of the identified NCL events, we define the NCL score (NCLscore) based on the variation in the number of supporting NCL junction reads identified by the tools examined. Of the input NCL events, we show that the ambiguous alignment-derived events have relatively lower NCLscore values than the other events, indicating that an NCL event with a higher NCLscore has a higher level of reliability. To help selecting highly expressed NCL events, NCLcomparator also provides a series of useful measurements such as the expression levels of the detected NCL events and their corresponding host genes and the junction usage of the co-linear splice junctions at both NCL donor and acceptor sites.
NCLcomparator provides useful guidelines, with the input of identified NCL events from various detectors and the corresponding paired-end RNA-seq data only, to help users selecting potentially high-confidence NCL events for further functional investigation. The software thus helps to facilitate future studies into NCL events, shedding light on the fundamental biology of this important but understudied class of transcripts. NCLcomparator is freely accessible at https://github.com/TreesLab/NCLcomparator.
KeywordsRNA-seq Non-co-linear RNA Circular RNA Trans-spliced RNA Gene fusion Alignment ambiguity
Fragments per kilobase of transcript per million mapped reads
Synonymous constraint elements
Transcripts per million
Whole genome sequencing
Transcriptome-wide analyses of high-throughput RNA sequencing (RNA-seq) have discovered a large amount of ‘non-co-linear’ (NCL) transcripts, in which the exonic sequences are topologically inconsistent with the reference genome in an intragenic fashion (circular or intragenic trans-spliced RNAs) or in an intergenic fashion (fusion or intergenic trans-spliced RNAs) [1, 2, 3, 4]. Although NCL transcripts were reported to be generally expressed at a rather low level compared with co-linear mRNAs, some NCL transcripts may be even more highly expressed than their corresponding co-linear isoforms  or evolutionarily conserved across species . Accumulating evidence shows their biological importance in gene regulation and disease diagnosis [4, 7, 8, 9]. For fusion transcripts, some were demonstrated to correlate with malignant hematological disorders and sarcomas [10, 11, 12, 13]. BCR-ABL1, a prominent example of fusion gene, was shown to be important in adult acute lymphoblastic leukemia cases and served as an effective biomarker for chronic myeloid leukemia [14, 15, 16, 17]. For trans-spliced RNAs, some may play a role in anti-apoptotic function [3, 18, 19] and prostate cancer [3, 20]. A trans-spliced long non-coding RNA, tsRMST, can regulate pluripotency maintenance of human embryonic stem cells (hESCs) by repressing WNT5A [7, 21]. For circular RNAs (circRNAs), they are ubiquitous and have been observed in diverse species [5, 22, 23, 24, 25, 26, 27]. The most famous function of circRNAs is their regulatory role in microRNA sponges [6, 28, 29, 30, 31, 32]. In addition, circRNAs can regulate their parent genes [4, 8, 33, 34, 35], or play a regulatory role in development [26, 36, 37], the aging nervous system , and cancer growth/metastasis [32, 39].
Nowadays, numerous RNA-seq-based NCL event detectors have been developed and employed to identify thousands of NCL transcript candidates in diverse species [40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]. However, detection of NCL events is still hampered by the potentially false calls arising from sequencing errors, ambiguous alignment, and in vitro artifacts, which leads to great discrepancies in the detection results among tools [4, 51, 52, 53]. In addition, the biogenesis and functions of circRNAs and trans-spliced RNAs are mostly unclear. Even if the computationally identified NCL events are in vivo, it remains debatable whether most of them are merely side-products of imperfect pre-mRNA splicing [24, 54]. As accumulating NCL events are detected, the reliability and function of the identified NCL events become an unavoidable question for further investigation. Although several studies have provided helpful guidelines for evaluating the performance of various NCL event-detection tools [1, 4, 51, 55, 56], a systematic guideline for measurement of NCL events identified by different tools has not been available. To reduce the cost of further validation and functional analysis, it is essential to systematically evaluate the reliability of the detected NCL events.
To provide useful guidelines on screening the NCL events identified by various detectors for users, we develop an analysis package, NCLcomparator, for systematic comparisons of the outputs from different detectors. First, for each input NCL event, NCLcomparator concatenates the sequence flanking the NCL junction and then examines whether this NCL event is potentially false positives derived from ambiguous alignments by aligning the concatenated sequence against the reference genome. Next, on the basis of the number of the supporting NCL junction reads derived from the tools compared, NCLcomparator defines the NCL score, NCLscore, to evaluate the reliability of the input NCL events. To help selecting highly expressed NCL events, NCLcomparator provides expression levels of NCL events and their corresponding co-linear host genes and calculates the ratio of the number of reads spanning the NCL junction to that spanning the co-linear splice junctions at both NCL donor and acceptor sites. NCLcomparator further estimates the frequencies of occurrence of the co-linear junctions at the NCL donor and acceptor splicing sites in the host genes to examine the usage of the NCL junctions. NCLcomparator also provides the number of the mapped paired-end read with a read spanning outside the identified intragenic circle, which can be regarded as a good indicator for discrimination between circRNAs and intragenic trans-spliced RNAs . Taken together, NCLcomparator is helpful not only for selecting highly confident and highly expressed NCL events but also for further investigating biogenesis and function of this important but understudied class of transcripts. Of note, NCLcomparator analyzes both intragenic and intergenic NCL events, allowing researchers for comparisons among circRNA detectors and among gene-fusion detectors.
Since repetitive sequences or paralogous genes often masquerade as NCL events due to ambiguous alignments of short RNA-seq reads [1, 25, 58, 59, 60], NCLcomparator checks the alignment ambiguity of the input NCL events and removes such potentially false positives. To this end, for each input NCL event, NCLcomparator concatenates the exonic sequence flanking the NCL junction (within − 100 nucleotides to + 100 nucleotides of each NCL junction) and then aligns the 200 bp concatenated sequence against the reference genome and well-annotated transcripts using BLAT . Of note, the concatenated sequence may be shorter than 200 bp if the flanking exonic circRNA sequence is shorter than 200 bp. A concatenated sequence is regarded as false positives derived from ambiguous alignments, if it contains at least an alternative co-linear explanation (the sequence similarity of the alternative co-linear explanation is more than 80% identical to that of the non-co-linear one; Fig. 1b, top) or maps to multiple positions with similar BLAT mapping scores (difference of BLAT-mapping scores < 3; Fig. 1b, bottom).
The number of intragenic and intergenic NCL events before and after screening
Number of NCL events
Intragenic NCL events
Alignment ambiguity (ambiguous NCL events)
Alternative co-linear explanation
After screening (non-ambiguous NCL events)
Intergenic NCL events
Alignment ambiguity (ambiguous NCL events)
Alternative co-linear explanation
After screening (non-ambiguous NCL events)
There are several major challenges for detection of NCL events. In addition to false positives arising from alignment ambiguity and biased identification of NCL events from different bioinformatics approaches as stated above, identification of NCL events is often hampered by in vitro artifacts, particularly template switching during reverse transcription (RT) [2, 7, 42, 45, 59, 70, 71]. Actually, to minimize potential RT-artifacts, it would be better to confirm identified NCL events using both RT- and non-RT-based experiments (e.g., Northern blot or RNase protection assay ). However, it is required to develop a method for systematic identification of NCL events with controlling for experimental artifacts. While a study successfully detected a huge number of experimental artifacts based on Drosophila hybrid mRNAs (D. melanogaster females vs. D. sechellia males) and a mixed mRNA-negative control sample , this approach would not be applied to human studies. Alternatively, it has been demonstrated that RTase-dependent RNA products are likely to be RT artifacts [2, 4, 7, 73, 74]. RT-based artifacts can be detected by comparisons of different RTase products, which was shown to serve as effectively as RTase-free validation [2, 7]. On the basis of comparisons of Avian Myeloblastosis Virus- and Moloney Murine Leukemia Virus-derived RTase products, a recent study successfully applied this concept to human samples and systematically identified NCL events with controlling for experimental artifacts .
Moreover, NCL junctions can be generated during post-transcriptional processes (trans-spliced or circular RNAs) or by genetic rearrangements (fusion RNAs) at the DNA level. Thus, discrimination between post-transcriptional NCL events and genetic rearrangements presents another challenge to detection/analysis of NCL events. Since NCL events that are observed in multiple biological samples or conserved across multiple species are less likely to be formed by somatic recombination, post-transcriptional NCL events may be extracted by this simple rule [2, 7]. A more efficient approach is to analyze both RNA-seq data and whole genome sequencing (WGS) data from the same sample. Some systematic pipelines have been developed, which integrated WGS-based rearrangement detection with RNA-seq-based NCL detection to identify fusion RNAs, and successfully applied to analysis of functionally recurrent gene fusions in human diseases [75, 76, 77, 78, 79, 80]. While many studies have focused on identification/analysis of fusion RNAs that consist of sequence fragments from different genes, transcribed rearrangements in an intragenic fashion is relatively less investigated.
With more and more NCL events are identified, the reliability and function of such a large number of NCL events remains an open question worthy of further investigation. To reduce the cost of subsequent validation and functional analysis, carefully evaluating the reliability of detected NCL events with considering all abovementioned challenges awaits further development.
Dozens of RNA-seq-based detectors have been developed and successfully identify thousands of NCL transcript candidates (circular, trans-spliced, or fusion RNA) in diverse species. However, there are great discrepancies in the identification results (including the number of NCL events and the number of the supporting NCL junction reads of the identified events) among tools, indicating a considerable proportion of potentially false positives in the results. NCLcomparator screens out potentially false positives originating from ambiguous alignments and provides a series of useful measurements, including NCL score (NCLscore), NCL ratio (RNCL), circular fraction (CF), the usage of the co-linear junctions at both NCL donor and acceptor splice sites in the corresponding host gene (PD, PA, and Pmedian), and the expression levels of NCL events (RPMraw and RPMmapped) and their corresponding co-linear host genes (FPKM and TPM), for users to screen the NCL events from various detectors. On the basis of the NCLcomparator-provided information, users can easily select potentially high-plausible NCL candidates with a high expression level and/or a low variation of supporting NCL junction reads from multiple NCL detectors. The software, a post-processing tool for screening identified NCL events from existing detectors, thus help to facilitate future studies into NCL events, shedding light on the fundamental biology of this important but understudied class of transcripts.
Availability and requirements
Project name: NCLcomparator.
Project home page: https://github.com/TreesLab/NCLcomparator
Operator system(s): Linux-like environment (Bio-Linux).
Programming language: shell script.
Other requirement: None.
Any restrictions to use by non-academics: None.
Data: The tested RNA-seq data was derived from HeLa cells with rRNA depletion, which was downloaded from the NCBI Sequence Read Archive (SRR1637089) at https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR1637089. All parameter settings and identification results of intragenic/intergenic NCL detectors tested in this study are reported in Additional file 1: Table S1 and Additional file 2: Table S2, respectively.
This work was supported by grants of the Genomics Research Center, Academia Sinica, Taiwan and the Ministry of Science and Technology (MOST), Taiwan (under the contract MOST 103–2628-B-001-001-MY4 and MOST 107–2311-B-001-046). The funding body did not play any role in the study design and collection, analysis and interpretation of the data and the write-up of the manuscript.
Availability of data and materials
The implementation of NCLcomparator software package, source code, and test data sets are available at https://github.com/TreesLab/NCLcomparator.
TJC designed the research and wrote the manuscript. CYC implemented the software and conducted the case studies. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 9.Chwalenia K, Facemire L, Li H. Chimeric RNAs in cancer and normal physiology. Wiley Interdiscip Rev RNA. 2017;8(6):e1427.Google Scholar
- 14.O'Brien SG, Guilhot F, Larson RA, Gathmann I, Baccarani M, Cervantes F, Cornelissen JJ, Fischer T, Hochhaus A, Hughes T, et al. Imatinib compared with interferon and low-dose cytarabine for newly diagnosed chronic-phase chronic myeloid leukemia. N Engl J Med. 2003;348(11):994–1004.PubMedCrossRefGoogle Scholar
- 17.Westbrook CA, Hooberman AL, Spino C, Dodge RK, Larson RA, Davey F, Wurster-Hill DH, Sobol RE, Schiffer C, Bloomfield CD. Clinical significance of the BCR-ABL fusion gene in adult acute lymphoblastic leukemia: a Cancer and leukemia group B study (8762). Blood. 1992;80(12):2983–90.PubMedGoogle Scholar
- 20.Rickman DS, Pflueger D, Moss B, VanDoren VE, Chen CX, de la Taille A, Kuefer R, Tewari AK, Setlur SR, Demichelis F, et al. SLC45A3-ELK4 is a novel and frequent erythroblast transformation-specific fusion transcript in prostate cancer. Cancer Res. 2009;69(7):2734–8.PubMedPubMedCentralCrossRefGoogle Scholar
- 37.Szabo L, Morey R, Palpant NJ, Wang PL, Afari N, Jiang C, Parast MM, Murry CE, Laurent LC, Salzman J. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biol. 2015;16:126.PubMedPubMedCentralCrossRefGoogle Scholar
- 41.Ha KC, Lalonde E, Li L, Cavallone L, Natrajan R, Lambros MB, Mitsopoulos C, Hakas J, Kozarewa I, Fenwick K, et al. Identification of gene fusion transcripts by transcriptome sequencing in BRCA1-mutated breast cancers and cell lines. BMC Med Genet. 2011;4:75.Google Scholar
- 44.Zhao Q, Caballero OL, Levy S, Stevenson BJ, Iseli C, de Souza SJ, Galante PA, Busam D, Leversha MA, Chadalavada K, et al. Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line. Proc Natl Acad Sci U S A. 2009;106(6):1886–91.PubMedPubMedCentralCrossRefGoogle Scholar
- 52.Abate F, Acquaviva A, Paciello G, Foti C, Ficarra E, Ferrarini A, Delledonne M, Iacobucci I, Soverini S, Martinelli G, et al. Bellerophontes: an RNA-Seq data analysis framework for chimeric transcripts discovery based on accurate fusion model. Bioinformatics. 2012;28(16):2114–21.PubMedCrossRefGoogle Scholar
- 60.Chen C-Y, Chuang T-J. Comment on A comprehensive overview and evaluation of circular RNA detection tools. PLoS Comp Biol. 2018; in press.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.