Features of Arabidopsis Genes and Genome Discovered using Full-length cDNAs
Received: 26 August 2004 Accepted: 29 August 2005 DOI:
Cite this article as: Alexandrov, N.N., Troukhan, M.E., Brover, V.V. et al. Plant Mol Biol (2006) 60: 69. doi:10.1007/s11103-005-2564-9 Abstract Arabidopsis is currently the reference genome for higher plants. A new, more detailed statistical analysis of Arabidopsis gene structure is presented including intron and exon lengths, intergenic distances, features of promoters, and variant 5′-ends of mRNAs transcribed from the same transcription unit. We also provide a statistical characterization of Arabidopsis transcripts in terms of their size, UTR lengths, 3′-end cleavage sites, splicing variants, and coding potential. These analyses were facilitated by scrutiny of our collection of sequenced full-length cDNAs and much larger collection of 5′-ESTs, together with another set of full-length cDNAs from Salk/Stanford/Plant Gene Expression Center/RIKEN. Examples of alternative splicing are observed for transcripts from 7% of the genes and many of these genes display multiple spliced isoforms. Most splicing variants lie in non-coding regions of the transcripts. Non-canonical splice sites constitute less than 1% of all splice sites. Genes with fewer than four introns display reduced average mRNA levels. Putative alternative transcription start sites were observed in 30% of highly expressed genes and in more than 50% of the genes with low expression. Transcription start sites correlate remarkably well with a CG skew peak in the DNA sequences. The intergenic distances vary considerably, those where genes are transcribed towards one another being significantly shorter. New transcripts, missing in the current TIGR genome annotation and ESTs that are non-coding, including those antisense to known genes, are derived and cataloged in the Supplementary Material. They identify 148 new loci in the Arabidopsis genome. The conclusions drawn provide a better understanding of the Arabidopsis genome and how the gene transcripts are processed. The results also allow better predictions to be made for, as yet, poorly defined genes and provide a reference for comparisons with other plant genomes whose complete sequences are currently being determined. Some comparisons with rice are included in this paper. Keywords Alternative splicing Arabidopsis genome Full-length cDNA Gene prediction Genome statistics Electronic supplementary material
to this article is available at
and is accessible for authorized users. http://dx.doi.org/10.1007/s11103-005-2564-9 References Beletskii, A., Bhagwat, A.S. 1996 Transcription-induced mutations: increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli Proc. Natl. Acad. Sci. USA 93 13919 13924 PubMed CrossRef Beletskii, A., Grigoriev, A., et al. 2000 Mutations induced by bacteriophage T7 RNA polymerase and their effects on the composition of the T7 genome J. Mol. Biol. 300 1057 1065 PubMed CrossRef Birney, E., Thompson, J.D., et al. 1996 PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames Nucleic Acids Res. 24 2730 2739 PubMed CrossRef Castelli, V., Aury, J.M., et al. 2004 Whole genome sequence comparisons and full-length cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation Genome Res. 14 406 413 PubMed CrossRef Daly, M.J. 2002 Estimating the human gene count Cell 109 283 284 PubMed CrossRef Danin-Kreiselman, M., Lee, C.Y., et al. 2003 RNAse III-mediated degradation of unspliced pre-mRNAs and lariat introns Mol. Cell 11 1279 1289 PubMed CrossRef Eddy, S.R. 2001 Non-coding RNA genes and the modern RNA world Nat. Rev. Genet. 2 919 929 PubMed CrossRef Florea, L., Hartzell, G., et al. 1998 A computer program for aligning a cDNA sequence with a genomic DNA sequence Genome Res. 8 967 974 PubMed Freeman, J.M., Plasterer, T.N., et al. 1998 Patterns of Genome Organization in Bacteria Science 279 1827 CrossRef
Gish, W. (1996–2001). BLASTN 2.0MP-WashU.
Goff, S.A., Ricke, D., et al. 2002 A draft sequence of the rice genome ( Oryza sativa L. ssp. japonica) Science 296 92 100 PubMed CrossRef Grigoriev, A. 1998a Analyzing genomes with cumulative skew diagrams Nucleic Acids Res. 26 2286 2290 CrossRef Grigoriev, A. 1998b Genome arithmetic Science 281 1923a CrossRef Grigoriev, A. 1999 Strand-specific compositional asymmetries in double-stranded DNA viruses Virus Res. 60 1 19 PubMed CrossRef Haas, B.J., Delcher, A.L., et al. 2003 Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies Nucleic Acids Res. 31 5654 5666 PubMed CrossRef
Haas, B.J., Volfovsky, N.
et al. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6). Hillman, R.T., Green, R.E., et al. 2004 An unappreciated role for RNA surveillance Genome Biol. 5 R8 PubMed CrossRef Huang, X., Adams, M.D., et al. 1997 A tool for analyzing and annotating genomic sequences Genomics 46 37 45 PubMed CrossRef Iida, K., Seki, M., et al. 2004 Genome-wide analysis of alternative pre-mRNA splicing in Arabidopsis thaliana based on full-length cDNA sequences Nucleic Acids Res. 32 5096 5103 PubMed CrossRef Kikuchi, S., Satoh, K., et al. 2003 Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice Science 301 376 379 PubMed CrossRef Ko, C.H., Brendel, V., et al. 1998 U-richness is a defining feature of plant introns and may function as an intron recognition signal in maize Plant Mol. Biol. 36 573 583 PubMed CrossRef Kochetov, A.V., Ponomarenko, M.P., et al. 1999 Prediction of eukaryotic mRNA translational properties Bioinformatics 15 704 712 PubMed CrossRef Lander, E.S., Linton, L.M., et al. 2001 Initial sequencing and analysis of the human genome Nature 409 860 921 PubMed CrossRef Mayer, K., Schuller, C., et al. 1999 Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana Nature 402 769 777 PubMed CrossRef
Mignone, F., Gissi, C. et al. 2002. Untranslated regions of mRNAs. Genome Biol. 3(3).
Mirkin, B. 1996. Mathematical Classification and Clustering, Kluwer Academic Publishers.
Mott, R. 1997 EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA Comput. Appl. Biosci. 13 477 478 PubMed Mrazek, J., Karlin, S. 1998 Strand compositional asymmetry in bacterial and large viral genomes Proc. Natl. Acad. Sci. USA 95 3720 3725 PubMed CrossRef Myllykallio, H., Lopez, P., et al. 2000 Bacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon Science 288 2212 2215 PubMed CrossRef Ner-Gaon, H., Halachmi, R., et al. 2004 Intron retention is a major phenomenon in alternative splicing in Arabidopsis Plant J. 39 877 885 PubMed CrossRef Petracek, M.E., Nuygen, T., et al. 2000 Premature termination codons destabilize ferredoxin-1 mRNA when ferredoxin-1 is translated Plant J. 21 563 569 PubMed CrossRef Picardeau, M., Lobry, J.R., et al. 2000 Analyzing DNA strand compositional asymmetry to identify candidate replication origins of Borrelia burgdorferi linear and circular plasmids Genome Res. 10 1594 1604 PubMed CrossRef Rhee, S.Y., Beavis, W., et al. 2003 The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community Nucleic Acids Res. 31 224 228 PubMed CrossRef Rogozin, I.B., Kochetov, A.V., et al. 2001 Presence of ATG triplets in 5′ untranslated regions of eukaryotic cDNAs correlates with a ‘weak’ context of the start codon Bioinformatics 17 890 900 PubMed CrossRef Rose, A.B., Beliakoff, J.A. 2000 Intron-mediated enhancement of gene expression independent of unique intron sequences and splicing Plant Physiol. 122 535 542 PubMed CrossRef Schmitz, A., Galas, D.J. 1979 The interaction of RNA polymerase and lac repressor with the lac control region Nucleic Acids Res. 6 111 137 PubMed Schoof, H., Ernst, R., et al. 2004 MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics Nucleic Acids Res 32 Database issue D373 D376 CrossRef Seki, M., Narusaka, M., et al. 2002 Functional annotation of a full-length Arabidopsis cDNA collection Science 296 141 145 PubMed CrossRef Shahmuradov, I.A., Gammerman, A.J., et al. 2003 PlantProm: a database of plant promoter sequences Nucleic Acids Res. 31 114 117 PubMed CrossRef Schmid, M., Davison, T.S., Henz, S.R., Pape, U.J., Demar, M., Vingron, M., Scholkopf, B., Weigel, D., Lohmann, J.U. 2005 A gene expression map of Arabidopsis thaliana development Nat. Genet. 37 501 506 PubMed CrossRef Storz, G. 2002 An expanding universe of noncoding RNAs Science 296 1260 1263 PubMed CrossRef The Arabidopsis Genome Initiative 2000 Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408 796 815 CrossRef Usuka, J., Zhu, W., et al. 2000 Optimal spliced alignment of homologous cDNA to a genomic DNA template Bioinformatics 16 203 211 PubMed CrossRef Venter, J.C., Adams, M.D., et al. 2001 The sequence of the human genome Science 291 1304 1351 PubMed CrossRef Yamada, K., Lim, J., et al. 2003 Empirical analysis of transcriptional activity in the Arabidopsis genome Science 302 842 846 PubMed CrossRef Yu, J., Hu, S., et al. 2002 A draft sequence of the rice genome ( Oryza sativa L. ssp. indica) Science 296 79 92 PubMed CrossRef Zavolan, M., Nimwegen, E.V., et al. 2002 Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome Genome Res. 12 1377 1385 PubMed CrossRef Zhao, J., Hyman, L., et al. 1999 Formation of mRNA 3′ ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis Microbiol. Mol. Biol. Rev. 63 405 445 PubMed Zhu, W., Schlueter, S.D., et al. 2003 Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping Plant Physiol. 132 469 484 PubMed CrossRef