Plant Molecular Biology

, Volume 60, Issue 1, pp 69–85 | Cite as

Features of Arabidopsis Genes and Genome Discovered using Full-length cDNAs

  • Nickolai N. Alexandrov
  • Maxim E. Troukhan
  • Vyacheslav V. Brover
  • Tatiana Tatarinova
  • Richard B. Flavell
  • Kenneth A. Feldmann


Arabidopsis is currently the reference genome for higher plants. A new, more detailed statistical analysis of Arabidopsis gene structure is presented including intron and exon lengths, intergenic distances, features of promoters, and variant 5′-ends of mRNAs transcribed from the same transcription unit. We also provide a statistical characterization of Arabidopsis transcripts in terms of their size, UTR lengths, 3′-end cleavage sites, splicing variants, and coding potential. These analyses were facilitated by scrutiny of our collection of sequenced full-length cDNAs and much larger collection of 5′-ESTs, together with another set of full-length cDNAs from Salk/Stanford/Plant Gene Expression Center/RIKEN. Examples of alternative splicing are observed for transcripts from 7% of the genes and many of these genes display multiple spliced isoforms. Most splicing variants lie in non-coding regions of the transcripts. Non-canonical splice sites constitute less than 1% of all splice sites. Genes with fewer than four introns display reduced average mRNA levels. Putative alternative transcription start sites were observed in 30% of highly expressed genes and in more than 50% of the genes with low expression. Transcription start sites correlate remarkably well with a CG skew peak in the DNA sequences. The intergenic distances vary considerably, those where genes are transcribed towards one another being significantly shorter. New transcripts, missing in the current TIGR genome annotation and ESTs that are non-coding, including those antisense to known genes, are derived and cataloged in the Supplementary Material. They identify 148 new loci in the Arabidopsis genome. The conclusions drawn provide a better understanding of the Arabidopsis genome and how the gene transcripts are processed. The results also allow better predictions to be made for, as yet, poorly defined genes and provide a reference for comparisons with other plant genomes whose complete sequences are currently being determined. Some comparisons with rice are included in this paper.


Alternative splicing Arabidopsis genome Full-length cDNA Gene prediction Genome statistics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material (8.4 mb)
ESM file (zip 8.5 MB)


  1. Beletskii, A., Bhagwat, A.S. 1996Transcription-induced mutations: increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coliProc. Natl. Acad. Sci. USA931391913924PubMedCrossRefGoogle Scholar
  2. Beletskii, A., Grigoriev, A.,  et al. 2000Mutations induced by bacteriophage T7 RNA polymerase and their effects on the composition of the T7 genomeJ. Mol. Biol.30010571065PubMedCrossRefGoogle Scholar
  3. Birney, E., Thompson, J.D.,  et al. 1996PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation framesNucleic Acids Res.2427302739PubMedCrossRefGoogle Scholar
  4. Castelli, V., Aury, J.M.,  et al. 2004Whole genome sequence comparisons and full-length cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotationGenome Res.14406413PubMedCrossRefGoogle Scholar
  5. Daly, M.J. 2002Estimating the human gene countCell109283284PubMedCrossRefGoogle Scholar
  6. Danin-Kreiselman, M., Lee, C.Y.,  et al. 2003RNAse III-mediated degradation of unspliced pre-mRNAs and lariat intronsMol. Cell1112791289PubMedCrossRefGoogle Scholar
  7. Eddy, S.R. 2001Non-coding RNA genes and the modern RNA worldNat. Rev. Genet.2919929PubMedCrossRefGoogle Scholar
  8. Florea, L., Hartzell, G.,  et al. 1998A computer program for aligning a cDNA sequence with a genomic DNA sequenceGenome Res.8967974PubMedGoogle Scholar
  9. Freeman, J.M., Plasterer, T.N.,  et al. 1998Patterns of Genome Organization in BacteriaScience2791827CrossRefGoogle Scholar
  10. Gish, W. (1996–2001). BLASTN 2.0MP-WashU.
  11. Goff, S.A., Ricke, D.,  et al. 2002A draft sequence of the rice genome (Oryza sativa L. ssp. japonica)Science29692100PubMedCrossRefGoogle Scholar
  12. Grigoriev, A. 1998aAnalyzing genomes with cumulative skew diagramsNucleic Acids Res.2622862290CrossRefGoogle Scholar
  13. Grigoriev, A. 1998bGenome arithmeticScience2811923aCrossRefGoogle Scholar
  14. Grigoriev, A. 1999Strand-specific compositional asymmetries in double-stranded DNA virusesVirus Res.60119PubMedCrossRefGoogle Scholar
  15. Haas, B.J., Delcher, A.L.,  et al. 2003Improving the Arabidopsis genome annotation using maximal transcript alignment assembliesNucleic Acids Res.3156545666PubMedCrossRefGoogle Scholar
  16. Haas, B.J., Volfovsky, N. et al. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6).Google Scholar
  17. Hillman, R.T., Green, R.E.,  et al. 2004An unappreciated role for RNA surveillanceGenome Biol.5R8PubMedCrossRefGoogle Scholar
  18. Huang, X., Adams, M.D.,  et al. 1997A tool for analyzing and annotating genomic sequencesGenomics463745PubMedCrossRefGoogle Scholar
  19. Iida, K., Seki, M.,  et al. 2004Genome-wide analysis of alternative pre-mRNA splicing in Arabidopsis thaliana based on full-length cDNA sequencesNucleic Acids Res.3250965103PubMedCrossRefGoogle Scholar
  20. Kikuchi, S., Satoh, K.,  et al. 2003Collection, mapping, and annotation of over 28,000 cDNA clones from japonica riceScience301376379PubMedCrossRefGoogle Scholar
  21. Ko, C.H., Brendel, V.,  et al. 1998U-richness is a defining feature of plant introns and may function as an intron recognition signal in maizePlant Mol. Biol.36573583PubMedCrossRefGoogle Scholar
  22. Kochetov, A.V., Ponomarenko, M.P.,  et al. 1999Prediction of eukaryotic mRNA translational propertiesBioinformatics15704712PubMedCrossRefGoogle Scholar
  23. Lander, E.S., Linton, L.M.,  et al. 2001Initial sequencing and analysis of the human genomeNature409860921PubMedCrossRefGoogle Scholar
  24. Mayer, K., Schuller, C.,  et al. 1999Sequence and analysis of chromosome 4 of the plant Arabidopsis thalianaNature402769777PubMedCrossRefGoogle Scholar
  25. Mignone, F., Gissi, C. et al. 2002. Untranslated regions of mRNAs. Genome Biol. 3(3).Google Scholar
  26. Mirkin, B. 1996. Mathematical Classification and Clustering, Kluwer Academic Publishers.Google Scholar
  27. Mott, R. 1997EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNAComput. Appl. Biosci.13477478PubMedGoogle Scholar
  28. Mrazek, J., Karlin, S. 1998Strand compositional asymmetry in bacterial and large viral genomesProc. Natl. Acad. Sci. USA9537203725PubMedCrossRefGoogle Scholar
  29. Myllykallio, H., Lopez, P.,  et al. 2000Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeonScience28822122215PubMedCrossRefGoogle Scholar
  30. Ner-Gaon, H., Halachmi, R.,  et al. 2004Intron retention is a major phenomenon in alternative splicing in ArabidopsisPlant J.39877885PubMedCrossRefGoogle Scholar
  31. Petracek, M.E., Nuygen, T.,  et al. 2000Premature termination codons destabilize ferredoxin-1 mRNA when ferredoxin-1 is translatedPlant J.21563569PubMedCrossRefGoogle Scholar
  32. Picardeau, M., Lobry, J.R.,  et al. 2000Analyzing DNA strand compositional asymmetry to identify candidate replication origins of Borrelia burgdorferi linear and circular plasmidsGenome Res.1015941604PubMedCrossRefGoogle Scholar
  33. Rhee, S.Y., Beavis, W.,  et al. 2003The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and communityNucleic Acids Res.31224228PubMedCrossRefGoogle Scholar
  34. Rogozin, I.B., Kochetov, A.V.,  et al. 2001Presence of ATG triplets in 5′ untranslated regions of eukaryotic cDNAs correlates with a ‘weak’ context of the start codonBioinformatics17890900PubMedCrossRefGoogle Scholar
  35. Rose, A.B., Beliakoff, J.A. 2000Intron-mediated enhancement of gene expression independent of unique intron sequences and splicingPlant Physiol.122535542PubMedCrossRefGoogle Scholar
  36. Schmitz, A., Galas, D.J. 1979The interaction of RNA polymerase and lac repressor with the lac control regionNucleic Acids Res.6111137PubMedGoogle Scholar
  37. Schoof, H., Ernst, R.,  et al. 2004MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomicsNucleic Acids Res32 Database issueD373D376CrossRefGoogle Scholar
  38. Seki, M., Narusaka, M.,  et al. 2002Functional annotation of a full-length Arabidopsis cDNA collectionScience296141145PubMedCrossRefGoogle Scholar
  39. Shahmuradov, I.A., Gammerman, A.J.,  et al. 2003PlantProm: a database of plant promoter sequencesNucleic Acids Res.31114117PubMedCrossRefGoogle Scholar
  40. Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D., Lohmann, J.U. 2005A gene expression map of Arabidopsis thaliana developmentNat. Genet.37501506PubMedCrossRefGoogle Scholar
  41. Storz, G. 2002An expanding universe of noncoding RNAsScience29612601263PubMedCrossRefGoogle Scholar
  42. The Arabidopsis Genome Initiative2000Analysis of the genome sequence of the flowering plant Arabidopsis thalianaNature408796815CrossRefGoogle Scholar
  43. Usuka, J., Zhu, W.,  et al. 2000Optimal spliced alignment of homologous cDNA to a genomic DNA templateBioinformatics16203211PubMedCrossRefGoogle Scholar
  44. Venter, J.C., Adams, M.D.,  et al. 2001The sequence of the human genomeScience29113041351PubMedCrossRefGoogle Scholar
  45. Yamada, K., Lim, J.,  et al. 2003Empirical analysis of transcriptional activity in the Arabidopsis genomeScience302842846PubMedCrossRefGoogle Scholar
  46. Yu, J., Hu, S.,  et al. 2002A draft sequence of the rice genome (Oryza sativa L. ssp. indica)Science2967992PubMedCrossRefGoogle Scholar
  47. Zavolan, M., Nimwegen, E.V.,  et al. 2002Splice variation in mouse full-length cDNAs identified by mapping to the mouse genomeGenome Res.1213771385PubMedCrossRefGoogle Scholar
  48. Zhao, J., Hyman, L.,  et al. 1999Formation of mRNA 3′ ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesisMicrobiol. Mol. Biol. Rev.63405445PubMedGoogle Scholar
  49. Zhu, W., Schlueter, S.D.,  et al. 2003Refined annotation of the Arabidopsis genome by complete expressed sequence tag mappingPlant Physiol.132469484PubMedCrossRefGoogle Scholar

Copyright information

© Springer 2006

Authors and Affiliations

  • Nickolai N. Alexandrov
    • 1
  • Maxim E. Troukhan
    • 1
  • Vyacheslav V. Brover
    • 1
  • Tatiana Tatarinova
    • 1
  • Richard B. Flavell
    • 1
  • Kenneth A. Feldmann
    • 1
  1. 1.Ceres Inc.Thousand OaksUSA

Personalised recommendations