A Parallel Expressed Sequence Tag (EST) Clustering Program
This paper describes the UIcluster software tool, which partitions Expressed Sequence Tag (EST) sequences and other genetic sequences into “clusters” based on sequence similarity. Ideally, each cluster will contain sequences that all represent the same gene. If a naýve approach such as an NxN comparison (N is the number of sequences input) is taken, the problem is only feasible for very small data sets. UIcluster has been developed over the course of four years to solve this problem efficiently and accurately for large data sets consisting of tens or hundreds of thousands of EST sequences. The latest version of the application has been parallelized using the MPI (message passing interface) standard. Both the computation and memory requirements of the program can be distributed among multiple (possibly distributed) UNIX processes.
KeywordsSimilarity Criterion Cluster Program Cluster Representative International Human Genome Sequencing Consortium Hash Length
Unable to display preview. Download preview PDF.
- 1.Adams M.D., Kerlavage A.R., Fleishmann R.D., Fuldner R.A., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., et al. (1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 377:3–17Google Scholar
- 4.Message Passing Interface Form (1994) MPI: A message-passing interface standard. University of Tennessee Technical Report CS-94-230Google Scholar
- 6.Parsons J.D., Brenner S., Bishop M.J. (1992) Clustering cDNA Sequences. Computational Applications in Bioscience 8:461–466Google Scholar