CNCA aligns small annotated genomes

Lorenzi, Jean-Noël; Graner, François; Courtier-Orgogozo, Virginie; Achaz, Guillaume

doi:10.1186/s12859-024-05700-1

CNCA aligns small annotated genomes

Software
Open access
Published: 29 February 2024

Volume 25, article number 89, (2024)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

CNCA aligns small annotated genomes

Download PDF

Jean-Noël Lorenzi^1,2,3,
François Graner^1,4,
Virginie Courtier-Orgogozo^1,2 &
…
Guillaume Achaz³

584 Accesses
3 Altmetric
Explore all metrics

Abstract

Background

To explore the evolutionary history of sequences, a sequence alignment is a first and necessary step, and its quality is crucial. In the context of the study of the proximal origins of SARS-CoV-2 coronavirus, we wanted to construct an alignment of genomes closely related to SARS-CoV-2 using both coding and non-coding sequences. To our knowledge, there is no tool that can be used to construct this type of alignment, which motivated the creation of CNCA.

Results

CNCA is a web tool that aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift. CNCA was designed to align closely related small genome sequences up to 50 kb (typically viruses) for which the gene order is conserved.

Conclusions

CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.

View this article's peer review reports

Construction of a Comprehensive Database from the Existing Viral Sequences Available from the International Nucleotide Sequence Database Collaboration

AlignWise: a tool for identifying protein-coding sequence and correcting frame-shifts

Article Open access 09 November 2015

VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank

Article Open access 23 January 2019

Background

A naive nucleotide alignment of annotated genomes usually results in many frameshifts and other oddities that do not exist in protein alignments. Several methods have been developed to perform nucleotide alignments taking protein alignment into account. One approach is “back-translation”, where coding nucleotide sequences are translated into amino acid sequences that are then aligned. Corresponding codons are then aligned in a final nucleotide alignment. The web-based tool web-prank (https://www.ebi.ac.uk/goldman-srv/webprank/; [1]) is such an example. Other tools based on back-translation propose specific options like the choice of genetic codes (PAL2NAL [2], transAlign [3], RevTans [4]). Some are designed to consider cases in which frameshifts or stop codons can occur (MACSE [5, 6], PAL2NAL [2], transAlign [3]). TranslatorX [7] checks the relevance of the amino acid alignment by finding regions of uncertainties in the amino acid alignment (masked by Gblocks [8]) and reports them in the nucleotide alignment. Others are optimized for virus gene sequences (NucAmino [9], VIRULIGN [10]. To the best of our knowledge, none of these methods processes genome alignment with both coding and non-coding regions. We have thus developed CNCA (Coding / Non-Coding Aligner), a genome-wide solution that returns a full genome alignment compatible with the protein sequence alignment. The method was designed for small (up to 50 kb) homologous annotated syntenic genomes devoid of introns, such as virus genomes. It will ease the subsequent evolutionary analysis of annotated genomes.

Implementation

CNCA is a pipeline developed in Python and R. For the alignment steps, it uses MAFFT [11]. This pipeline can be run online at https://cnca.ijm.fr/.

In addition, a standalone version is available at https://github.com/jnlorenzi/CNCA_standalone.

CNCA takes as input two or more GenBank files of annotated genomes. To cap computation time on the server, sequences submitted via the online tool must be lower than 50 kb. It first MAFFT-aligns [11] the nucleotide (nt) sequence of all genomes and produces a Multiple Sequence Alignment (MSAnt). It then generates MSAaa, the MAFFT-alignment of the concatenations of all protein sequences. As the concatenated sequence takes protein sequence on the order of gene annotations, synteny must be conserved. Note that an alternative pipeline would have been to align each coding region individually between genomes, but this approach was not chosen for the sake of speed and simplicity. The MSAnt is then updated using MSAaa for all coding regions where both alignments are not concordant. A final MSAcnca is returned that contains no contradiction with MSAaa and thus no frameshift (Fig. 1A). We choose to implement a graphical web version of the pipeline to widen the potential users to non-experts. Results (logs and the three alignments MSAcnca, MSAnt, MSAaa in both nexus and fasta formats) are stored locally for a week. An email with a link to access the results is sent to the user at the end of the procedure.

Results

As an illustration, we used CNCA on a dataset of 12 annotated genomes closely related to SARS-CoV-2. The whole pipeline runs in 45 min and generates an alignment compatible with current knowledge of coronavirus evolution. Figure 1B presents a fraction of the resulting alignment, from the end of the ORF1ab coding region to the start of the Spike coding region. The 1-bp indel present in the intergenic region between ORF1ab and Spike is detected by the CNCA approach, but not via a simple nucleotide alignment (Fig. 1C) or via a back-translation method (as it ignores non-coding regions).

Conclusions

CNCA is a user-friendly and simple online tool. It can construct multiple alignments of small genomes by integrating both coding and non-coding sequences. We developed it for coronaviruses and it can also be used for other virus families and for short syntenic genetic loci in bacteria.

Availability and requirements

Project name: CNCA.

Project home page: https://cnca.ijm.fr/

Operating system(s): Platform independent.

Programming language: Python, R, PHP.

License: MIT.

Any restrictions to use by non-academics: none.

Availability of data and materials

Project homepage: https://cnca.ijm.fr/; Standalone version available at https://github.com/jnlorenzi/CNCA_standalone.

References

Löytynoja A, Goldman N. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics. 2010;11:579.
Article PubMed PubMed Central Google Scholar
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609-12.
Article CAS PubMed PubMed Central Google Scholar
Bininda-Emonds OR. transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics. 2005;6:156.
Article PubMed PubMed Central Google Scholar
Wernersson R, Pedersen AG. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003;31:3537–9.
Article CAS PubMed PubMed Central Google Scholar
Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol. 2018;35:2582–4.
Article CAS PubMed PubMed Central Google Scholar
Ranwez V, Harispe S, Delsuc F, Douzery EJP. MACSE: multiple alignment of coding sequences accounting for frameshifts and stop codons. PLoS ONE. 2011;6: e22594.
Article CAS PubMed PubMed Central ADS Google Scholar
Abascal F, Zardoya R, Telford MJ. TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. 2010;38:W7-13.
Article CAS PubMed PubMed Central Google Scholar
Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–52.
Article CAS PubMed Google Scholar
Tzou PL, Huang X, Shafer RW. NucAmino: a nucleotide to amino acid alignment optimized for virus gene sequences. BMC Bioinformatics. 2017;18:138.
Article PubMed PubMed Central Google Scholar
Libin PJK, Deforche K, Abecasis AB, Theys K. VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics. 2019;35:1763–5.
Article CAS PubMed Google Scholar
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490–2.
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

This work was supported by the Labex “Who AM I?”, ANR-11-LABX- 0071 and the Université Paris Cité, Idex ANR-18- IDEX-0001, funded by the French Government through its “Investments for the Future” program.

Author information

Authors and Affiliations

Université Paris Cité, Paris, France
Jean-Noël Lorenzi, François Graner & Virginie Courtier-Orgogozo
CNRS, Institut Jacques Monod, 75013, Paris, France
Jean-Noël Lorenzi & Virginie Courtier-Orgogozo
SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, 75006, Paris, France
Jean-Noël Lorenzi & Guillaume Achaz
CNRS, Matière Et Systèmes Complexes, 75013, Paris, France
François Graner

Authors

Jean-Noël Lorenzi
View author publications
You can also search for this author in PubMed Google Scholar
François Graner
View author publications
You can also search for this author in PubMed Google Scholar
Virginie Courtier-Orgogozo
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Achaz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JNL: conceptualization (equal), software (lead), writing – review & editing (equal). FG: conceptualization (equal), funding acquisition (equal), supervision (equal). VCO: conceptualization (equal), funding acquisition (equal), supervision (equal), writing – review & editing (equal), GA: conceptualization (equal), funding acquisition (equal), supervision (equal), writing – original draft (lead).

Corresponding author

Correspondence to Jean-Noël Lorenzi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Lorenzi, JN., Graner, F., Courtier-Orgogozo, V. et al. CNCA aligns small annotated genomes. BMC Bioinformatics 25, 89 (2024). https://doi.org/10.1186/s12859-024-05700-1

Download citation

Received: 19 September 2023
Accepted: 12 February 2024
Published: 29 February 2024
DOI: https://doi.org/10.1186/s12859-024-05700-1

CNCA aligns small annotated genomes