Journal of Biosciences

, Volume 41, Issue 3, pp 455–474

Identifying wrong assemblies in de novo short read primary sequence assembly contigs

Article

DOI: 10.1007/s12038-016-9630-0

Cite this article as:
Chawla, V., Kumar, R. & Shankar, R. J Biosci (2016) 41: 455. doi:10.1007/s12038-016-9630-0

Abstract

With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy.

Keywords

Assembly validation clustering contigs de novo assembly mis-assembly next generation sequencing reads 

Abbreviations used

ACC

accuracy

BAC

bacterial artificial chromosome

CE

compression-expansion

FCD

fragment coverage distribution

FN

false negative

FP

false positive

MCC

Matthews correlation coefficient

NCBI

National Center for Biotechnology Information

PDBG

paired de Bruijn graphs

PE

paired end

SBS

sequencing-by-synthesis

SE

single end

SRA

sequence read archive

TN

true negative

TP

true positive

WGS

whole genome shotgun

Supplementary material

12038_2016_9630_MOESM1_ESM.pdf (977 kb)
ESM 1(PDF 977 kb)

Funding information

Funder NameGrant NumberFunding Note
CSIR
  • GENESIS (BSC-121)

Copyright information

© Indian Academy of Sciences 2016

Authors and Affiliations

  1. 1.Studio of Computational Biology & Bioinformatics, Biotechnology DivisionCSIR-Institute of Himalayan Bioresource TechnologyPalampurIndia
  2. 2.Department of BiotechnologyGuru Nanak Dev UniversityAmritsarIndia

Personalised recommendations