Practical Software for Aligning ESTs to Human Genome

Ogasawara, Jun; Morishita, Shinichi

doi:10.1007/3-540-45452-7_1

Jun Ogasawara^6,7 &
Shinichi Morishita^6,7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2373))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

374 Accesses

Abstract

There is a pressing need to align growing set of expressed sequence tags (ESTs) to newly sequenced human genome that is still frequently revised, for providing biologists and medical scientists with fresh information. The problem is, however, complicated by the exon/intron structure of eucaryotic genes, misread nucleotides in ESTs, and millions of repeptive sequences in genomic sequences. Indeed, to solve this, algorithms that use dynamic programming have been proposed, in which space complexity is O(N) and time complexity is O(MN) for a genomic sequence of length M and an EST of length N, but in reality, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we develop software that fully utilizes the lookup-table that stores the position at which each short subsequence occurs in the genomic sequence for allowing the efficient detection of the start-and endpoints of an EST within a given DNA sequence, and subsequently, the prompt identification of exons and introns. In addition, high sensitivity and accuracy must be achieved by calculating locations of all spliced sites correctly for more ESTs while retaining high computational efficiency. This goal is hard to accomplish in practice, owing to misread nucleotides in ESTs and repeptive sequences in the genome, but we present a couple of heuristics effective in settling this issue. Experimental results have confirmed that our technique improves the overall computation time by orders of magnitude compared with common tools such as sim4 and BLAT, and attains high sensitivity and accuracy against datasets of clean and documented genes at the same time. Consequently, our software is able to align about three millions of ESTs to a draft genome in less than one day, and all the information is available through the WWW at http://grl.gi.k.u-tokyo.ac.jp/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Craig Venter et al. The sequence of the Human Genome Science, 291:1304–1351 (2001)
Article Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome Nature, 409:860–921 (2001)
Article Google Scholar
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453 (1970)
Article Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197 (1981)
Article Google Scholar
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceeding of the National Academy of Sciences, 85:2444–2448 (1988)
Article Google Scholar
S. F. Altschul, W. Gis, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410 (1990)
Google Scholar
O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–708 (1982)
Article Google Scholar
Daniel S. Hirschberg. A Linear Space Algorithm for Computing Maximal Common Subsequences. CACM 18(6): 341–343 (1975)
MATH MathSciNet Google Scholar
M. S. Gelfand, A. A. Mironov, and P. A. Pevzner. Spliced alignment: A new approach to gene recognition. Proc. Natl. Acad. Sci. 93:9061–9066 (1996)
Article Google Scholar
E. Birney and R. Durbin. Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. Proc. Fifth Int. Conf. Intelligent Systems Mol. Biol. 5:55–64 (1997)
Google Scholar
R. Mott. EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13:477–478 (1997)
Google Scholar
L. Florea, G. Hartzell, Z. Zhang, G. M. Rubin, and W. Miller. A computer program for aligning a cDNA sequence with a genomic sequence. Genome Research, 8(9):967–974 (1998)
Google Scholar
W. James Kent. UCSC Human Genome Browser, http://genome.ucsc.edu/
M. Burset and R. Guigo. Evaluation of gene structure prediction programs. Genomics, 34:353–357 (1996).
Article Google Scholar
T. A. Thanaraj. A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures. Nucl. Acids. Res. 27: 2627–2637 (1999).
Article Google Scholar
F. Clark and T. A. Thanaraj. Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Hum. Mol. Genet. 11: 451–464 (2002)
Article Google Scholar
M. Burset, I. A. Seledtsov, and V. V. Solovyev. SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucl. Acids. Res. 29: 255–259 (2001).
Article Google Scholar
S. Rogic, A. Mackworth and F. Ouellette. Evaluation of gene finding programs. Genome Research, 11: 817–832 (2001).
Article Google Scholar
T. Honkura, J. Ogasawara, T. Yamada, and S. Morishita. The Gene Resource Locator: gene locus maps for transcriptome analysis. Nucl. Acids. Res., 30(1):221–225 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Tokyo, Japan
Jun Ogasawara & Shinichi Morishita
Department of Complexity Science and Engineering, University of Tokyo, Japan
Jun Ogasawara & Shinichi Morishita

Authors

Jun Ogasawara
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Morishita
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrial Engineering and Computer Science, University of Padova, Via Gradenigo 6/A, 35131, Padova, Italy
Alberto Apostolico
Department of Informatics, Kyushu University, Fukuoka 812-8581, Japan
Masayuki Takeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ogasawara, J., Morishita, S. (2002). Practical Software for Aligning ESTs to Human Genome. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_1

Download citation

DOI: https://doi.org/10.1007/3-540-45452-7_1
Published: 21 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics