Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences

Tao, Zhi-Yong; Sui, Xu; Jun, Cao; Culleton, Richard; Fang, Qiang; Xia, Hui; Gao, Qi

doi:10.1186/s13071-015-0927-x

Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences

Letter to the Editor
Open access
Published: 12 June 2015

Volume 8, article number 318, (2015)
Cite this article

Download PDF

You have full access to this open access article

Parasites & Vectors Aims and scope Submit manuscript

Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences

Download PDF

Zhi-Yong Tao¹,
Xu Sui^2,3,4,
Cao Jun^2,3,4,
Richard Culleton⁵,
Qiang Fang¹,
Hui Xia¹ &
…
Qi Gao^2,3,4

1821 Accesses
3 Citations
Explore all metrics

Abstract

We found a 47 aa protein sequence that occurs 17 times in the Plasmodium vivax nucleotide database published on PlasmoDB. Coding sequence analysis showed multiple restriction enzyme sites within the 141 bp nucleotide sequence, and a His6 tag attached to the 3’ end, suggesting cloning vector origins. Sequences with vector contamination were submitted to NCBI, and BLASTN was used to cross-examine whole-genome shotgun contigs (WGS) from four recently deposited P. vivax whole genome sequencing projects. There are at least 26 genes listed in the PlasmoDB database that incorporate this cloning vector sequence into their predicted provisional protein products.

Findings

Genome databases are of great value for biomedical research, and have significantly advanced our understanding of the biology of multiple parasite species, including Plasmodium falciparum and Plasmodium vivax, the two most common malaria parasites [1, 2]. The latter genome sequence was produced by shotgun sequencing by Carlton et al. at TIGR in 2008 at five fold coverage, and is deposited at GenBank and PlasmoDB [3]. Assembly errors are inevitable when constructing genomes, and, in the case of intracellular parasites, contamination with host DNA sequence also poses a problem. Indeed, recent research has shown that many published genomes, including mammalian, contain contaminating sequence from a variety of microorganisms [4]. Considering gene prediction errors and malaria parasites specifically, Lu et al. reported that about 20 % of genes are incorrectly predicted in the P. falciparum genome database, although these are mostly due to errors arising from the gene prediction software used [5].

During a search for repetitive protein fragments in the P. vivax genome conducted on the nucleotide sequences deposited in PlasmoDB [6] we found a 47 amino acid (aa) sequence (KGQDNSADIQHSGGRSSLEGPRFEGKPIPNPLLGLDSTRTGHHHHHH) repeated a total of 17 times in several annotated contigs. A His6 tag (Fig 1A) was attached to the 3’ end, and multiple restriction enzyme sites (Fig 1B) were present within the 141 bp nucleotide sequence (AAG GGT CAA GAC AAT TCT GCA GAT ATC CAG CAC AGT GGC GGC CGC TCG AGT CTA GAG GGC CCG CGG TTC GAA GGT AAG CCT ATC CCT AAC CCT CTC CTC GGT CTC GAT TCT ACG CGT ACC GGT CAT CAT CAC CAT CAC CAT). This sequence, when run through a VecScreen search (NCBI, http://www.ncbi.nlm.nih.gov/tools/vecscreen/) shows significant similarity to the promoter probe vector pMQ354 (Fig 1C). These features suggest cloning vector sequence contamination. We performed BLASTN searches of these 17 coding sequences against whole-genome shotgun contigs (WGS) of four whole genome sequences (India VII [GenBank: AFMK01000000], North Korean [GenBank: AFBK01000000], Brazil I [GenBank: AFNI01000000], Mauritania I [GenBank: AFNJ01000000]) [7]. All hits were aligned with the reference sequence, and the results showed missing or substituted base pairs at the 3′ end of the query sequences, resulting in the absence of the correct stop codon of the parasite gene, and the incorporation of the vector sequence into the predicted parasite gene protein product, which then terminated at the vector stop codon. Considering that there may be a possibility of frame shifting, we translated the coding sequence in all three frames (Fig 1A), and frames two and three protein were used as query sequences against the PlasmoDB protein database. This resulted in five and four sequence hits respectively, and these nine sequences were subjected to alignment and correction as described before. In total, we discovered 26 sequences in PlasmoDB contaminated by the vector sequence (Table 1).

Table 1 Correction of 26 genes affected by a contaminated cloning vector sequence in PlasmoDB

Full size table

Generally, cloning vector source sequences are relatively easily recognized by a variety of tools, such as VecScreen. The P. vivax database has been updated more than ten times [8], and yet this vector sequence contamination persists, suggesting that it may have special characteristics that render it difficult to identify automatically. Attempted PCR amplification of Sal-1 genomic DNA using primers specific for the potential contaminating sequence would provide definitive proof of whether these sequences really are present in the genome, a scenario we believe to be highly unlikely.

The publication of four geographical reference strain whole genome sequences now provides an opportunity for the correction of the genome sequence of the Sal-I reference genome. Given our findings, it is possible that further interrogation of the P. vivax genome deposited in PlasmoDB may reveal further contamination. It is also possible that any previous work that made use of these sequences may require reappraisal.

References

Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511.
Article CAS PubMed Google Scholar
Carlton JM, Adams JH, Silva JC, Bidwell SL, Lorenzi H, Caler E, et al. Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 2006;455(7214):757–63.
Article Google Scholar
Carlton J. The Plasmodium vivax genome sequencing project. Trends Parasitol. 2003;19(5):227–31.
Article CAS PubMed Google Scholar
Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ. 2014;2, e675.
Article PubMed Central PubMed Google Scholar
Lu F, Jiang H, Ding J, Mu J, Valenzuela JG, Ribeiro JM, et al. cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome. BMC Genomics. 2007;8:255.
Article PubMed Central PubMed Google Scholar
Tao ZY, Xu S, Wang YY, Fang Q, Xia H, Gao Q. Plasmodium vivax specific peptides prediction and screening based on repetitive protein sequences and linear B cell epitope. Zhongguo Xue Xi Chong Bing Fang Zhi Za Zhi. 2014;26(3):292–5. 310. [Article in Chinese].
CAS PubMed Google Scholar
Neafsey DE, Galinsky K, Jiang RH, Young L, Sykes SM, Saif S, et al. The malaria parasite Plasmodium vivax exhibits greater genetic diversity than Plasmodium falciparum. Nat Genet. 2012;44(9):1046–50.
Article CAS PubMed Central PubMed Google Scholar
Bahl A, Brunk B, Crabtree J, Fraunholz MJ, Gajria B, Grant GR, et al. PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data. Nucleic Acids Res. 2003;31(1):212–5.
Article CAS PubMed Central PubMed Google Scholar

Download references

Acknowledgements

We thank Dr. Lu Feng from JIPD for providing valuable advice. And we thank the peer reviewers for their insightful and constructive comments. This work was supported by grants from the National S & T Major Program (Grant No. 2012ZX10004220), the Open Programme of Key Laboratory on Technology for Parasitic Disease Prevention and Control of Chinese Ministry of Health (No. WK014-003), the Anhui Provincial Natural Science Foundation (No. 1308085MH160), the Key Program of Bengbu Medical College Science & Technology Development Fund (No. Bykf13A09) and Natural Science Fund (No. BYKY1402ZD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Department of Parasitology, Bengbu Medical College, Bengbu, People’s Republic of China
Zhi-Yong Tao, Qiang Fang & Hui Xia
Jiangsu Institute of Parasitic Diseases, Wuxi, China
Xu Sui, Cao Jun & Qi Gao
Key Laboratory of Parasitic Disease Control and Prevention, Ministry of Health, Wuxi, China
Xu Sui, Cao Jun & Qi Gao
Jiangsu Provincial Key Laboratory of Parasite Molecular Biology, Wuxi, China
Xu Sui, Cao Jun & Qi Gao
Malaria Unit, Institute of Tropical Medicine, Nagasaki University, Sakamoto, Nagasaki, Japan
Richard Culleton

Authors

Zhi-Yong Tao
View author publications
You can also search for this author in PubMed Google Scholar
Xu Sui
View author publications
You can also search for this author in PubMed Google Scholar
Cao Jun
View author publications
You can also search for this author in PubMed Google Scholar
Richard Culleton
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Xia
View author publications
You can also search for this author in PubMed Google Scholar
Qi Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hui Xia or Qi Gao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contribution

ZYT, HX and QG conceived the study and participated in its design and coordination. ZYT, SX and QF carried out sequence comparison and correction. ZYT and RC wrote the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Tao, ZY., Sui, X., Jun, C. et al. Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences. Parasites Vectors 8, 318 (2015). https://doi.org/10.1186/s13071-015-0927-x

Download citation

Received: 22 April 2015
Accepted: 02 June 2015
Published: 12 June 2015
DOI: https://doi.org/10.1186/s13071-015-0927-x

Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences

Abstract

Findings

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors’ contribution

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vector sequence contamination of the Plasmodium vivax sequence database in PlasmoDB and In silico correction of 26 parasite sequences

Abstract

Findings

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors’ contribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation