De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

  • Kristoffer SahlinEmail author
  • Paul Medvedev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)


Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at



This work has been supported in part by NSF awards DBI-1356529, CCF-551439057, IIS-1453527, and IIS-1421908 to PM.


  1. 1.
    Byrne, A., et al.: Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Commun. 8, 16027 (2017)CrossRefGoogle Scholar
  2. 2.
    Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Commun. 9(1), 4601 (2018)CrossRefGoogle Scholar
  3. 3.
    Tseng, E., Tang, H.T., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochimica et Biophys. Acta (BBA)-Gene Regul. Mech. 1860(11), 1117–1126 (2017)CrossRefGoogle Scholar
  4. 4.
    Nattestad, M., et al.: Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28(8), 1126–1135 (2018)CrossRefGoogle Scholar
  5. 5.
    Kuo, R.I., Tseng, E., Eory, L., Paton, I.R., Archibald, A.L., Burt, D.W.: Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics 18(1), 323 (2017)CrossRefGoogle Scholar
  6. 6.
    Hoang, N.V., et al.: A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 18(1), 395 (2017)CrossRefGoogle Scholar
  7. 7.
    Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS One 10(7), e0132628 (2015)CrossRefGoogle Scholar
  8. 8.
    Tombácz, D., et al.: Long-read isoform sequencing reveals a hidden complexity of the transcriptional landscape of herpes simplex virus type 1. Front. Microbiol. 8, 1079 (2017)CrossRefGoogle Scholar
  9. 9.
    Marchet, C., et al.: De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 47(1), e1 (2018). Scholar
  10. 10.
    Workman, R.E., Myrka, A.M., Wong, G.W., Tseng, E., Welch Jr., K.C., Timp, W.: Single-molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris. GigaScience 7(3), 1–12 (2018). Scholar
  11. 11.
    Li, J., et al.: Long read reference genome-free reconstruction of a full-length transcriptome from astragalus membranaceus reveals transcript variants involved in bioactive compound biosynthesis. Cell Discov. 3, 17031 (2017)CrossRefGoogle Scholar
  12. 12.
    Liu, X., Mei, W., Soltis, P.S., Soltis, D.E., Barbazuk, W.B.: Detecting alternatively spliced transcript isoforms from single-molecule long-read sequences without a reference genome. Mol. Ecol. Resour. 17(6), 1243–1256 (2017)CrossRefGoogle Scholar
  13. 13.
    Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)CrossRefGoogle Scholar
  14. 14.
    Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)CrossRefGoogle Scholar
  15. 15.
    James, B.T., Luczak, B.B., Girgis, H.Z.: MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 46(14), e83 (2018)CrossRefGoogle Scholar
  16. 16.
    Ghodsi, M., Liu, B., Pop, M.: DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 12(1), 271 (2011)CrossRefGoogle Scholar
  17. 17.
    Paccanaro, A., Casbon, J.A., Saqi, M.A.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)CrossRefGoogle Scholar
  18. 18.
    Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 2542 (2018)CrossRefGoogle Scholar
  19. 19.
    Steinegger, M., Söding, J.: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35(11), 1026–1028 (2017)Google Scholar
  20. 20.
    Zorita, E., Cusco, P., Filion, G.J.: Starcode: sequence clustering based on all-pairs search. Bioinformatics 31(12), 1913–1919 (2015)CrossRefGoogle Scholar
  21. 21.
    Bevilacqua, V., et al.: EasyCluster2: an improved tool for clustering and assembling long transcriptome reads. BMC Bioinform. 15, S7 (2014)CrossRefGoogle Scholar
  22. 22.
    Dost, B., Wu, C., Su, A., Bafna, V.: TCLUST: a fast method for clustering genome-scale expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 8(3), 808–818 (2011)CrossRefGoogle Scholar
  23. 23.
    Christoffels, A., Gelder, A.V., Greyling, G., Miller, R., Hide, T., Hide, W.: STACK: sequence tag alignment and consensus knowledgebase. Nucleic Acids Res. 29(1), 234–238 (2001)CrossRefGoogle Scholar
  24. 24.
    Burke, J., Davison, D., Hide, W.: d2\_cluster: a validated method for clustering EST and full-length cDNA sequences. Genome Res. 9(11), 1135 (1999)CrossRefGoogle Scholar
  25. 25.
    Chong, Z., Ruan, J., Wu, C.I.: Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics 28(21), 2732–2737 (2012)CrossRefGoogle Scholar
  26. 26.
    Solovyov, A., Lipkin, W.I.: Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinform. 14(1), 268 (2013)CrossRefGoogle Scholar
  27. 27.
    Bao, E., Jiang, T., Kaloshian, I., Girke, T.: SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)CrossRefGoogle Scholar
  28. 28.
    Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)CrossRefGoogle Scholar
  29. 29.
    Shimizu, K., Tsuda, K.: SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4), 464–470 (2010)CrossRefGoogle Scholar
  30. 30.
    Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 4 (2015)CrossRefGoogle Scholar
  31. 31.
    Alanko, J., Cunial, F., Belazzougui, D., Mäkinen, V.: A framework for space-efficient read clustering in metagenomic samples. BMC Bioinform. 18(3), 59 (2017)CrossRefGoogle Scholar
  32. 32.
    Orabi, B., et al.: Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics (2018).
  33. 33.
    Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)CrossRefGoogle Scholar
  34. 34.
    Davidson, N.M., Oshlack, A.: Corset: enabling differential gene expression analysis for de novoassembled transcriptomes. Genome Biol. 15(7), 410 (2014). Scholar
  35. 35.
    Malik, L., Almodaresi, F., Patro, R.: Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 34(19), 3265–3272 (2018)CrossRefGoogle Scholar
  36. 36.
    Krishnakumar, R., et al.: Systematic and stochastic influences on the performance of the MinION nanopore sequencer across a range of nucleotide bias. Sci. Rep. 8(1), 3159 (2018)CrossRefGoogle Scholar
  37. 37.
    Tseng, E.: Cogent: coding genome reconstruction using Iso-Seq data (2018).
  38. 38.
    Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)CrossRefGoogle Scholar
  39. 39.
    Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving PacBio long read accuracy by short read alignment. PloS one 7(10), e46679 (2012)CrossRefGoogle Scholar
  40. 40.
    Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)CrossRefGoogle Scholar
  41. 41.
    Daily, J.: Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform. 17(1), 81 (2016)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Sahlin, K., Medvedev, P.: Exprimental details appendix to “de novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm” (2019).
  43. 43.
    Stöcker, B.K., Köster, J., Rahmann, S.: SimLoRD: simulation of long read data. Bioinformatics 32(17), 2704–2706 (2016)CrossRefGoogle Scholar
  44. 44.
  45. 45.
    Direct RNA and cDNA sequencing of a human transcriptome on Oxford Nanopore MinION and GridION. Accessed 24 Oct 2018
  46. 46.
    Korlach, J., et al.: De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. GigaScience 6(10) (2017).
  47. 47.
    Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)Google Scholar
  48. 48.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringPennsylvania State UniversityState CollegeUSA
  2. 2.Department of Biochemistry and Molecular BiologyPennsylvania State UniversityState CollegeUSA
  3. 3.Center for Computational Biology and BioinformaticsPennsylvania State UniversityState CollegeUSA

Personalised recommendations