Protein Sequence Motif Discovery on Distributed Supercomputer
The motif discovery problem has gained lot of significance in biological science over the past decade. Recently, various approaches have been used successfully to discover motifs. Some of them are based on probabilistic approach and others on combinatorial approach. We follow a graph-based approach to solve this problem, in particular, using the idea of de Bruijn graphs. The de Bruijn graph has been successfully adopted in the past to solve problems such as local multiple alignment and DNA fragment assembly. The proposed algorithm harnesses the power of the de Bruijn graph to discover the conserved regions such as motifs in a protein sequence. The sequential algorithm has 70% matches of the motifs with the MEME and 65% pattern matches with the Gibbs motif sampler. The motif discovery problem is data intensive requiring substantial computational resources and cannot be solved on a single system. In this paper, we use the distributed supercomputers available on the Western Canada Research Grid (WestGrid) to implement the distributed graph based approach to the motif discovery problem and study its performance analysis. We use the available resources efficiently to distribute data among the multicore nodes in the machine and redesign the algorithm to suit the architecture. We show that a pure distributed implementation is not efficient for this problem. We develop a hybrid algorithm that uses fine grain parallelism within the nodes and coarse grain parallelism across the nodes. Experiments show that this hybrid algorithm runs 3 times faster than the pure distributed memory implementation.
KeywordsParallel Algorithm Motif Discovery Adjacency List Graph Traversal Motif Discovery Algorithm
Unable to display preview. Download preview PDF.
- 1.Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Readings in uncertain reasoning, 452–472 (1990)Google Scholar
- 6.Liu, X., Liu, J.S., Brutlag, D.L.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pacific Symposium on Biocomputing, pp. 127–138 (2001)Google Scholar
- 10.Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, USA, pp. 269–278. AAAI Press, Menlo Park (2000)Google Scholar
- 12.Setubal, J.C., Meidanis, J.: Introduction to Computational Molecular Biology. PWS Publishing Company, Boston (1997)Google Scholar
- 13.Westgrid Canada Research Grid, http://www.westgrid.ca/home.html
- 14.Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: The Second International Conference on Intelligent Systems for Molecular Biology, Stanford, CA, USA, pp. 28–36. AAAI Press, Menlo Park (1994)Google Scholar
- 15.Challa, S., Thulasiraman, P.: A graph based approach to discover conserved regions in DNA and protein sequences. In: Symposium on Bioinformatics and Life Science Computing, pp. 672–677 (2007)Google Scholar
- 17.Sutou, T., Tamura, K., Mori, Y., Kitakami, H.: Design and implementation of parallel modified prefixspan method. In: International Symposium on High Performance Computing, pp. 412–422 (2003)Google Scholar
- 18.Baldwin, N.E., Collins, R.L., Langston, M.A., Symons, C.T., Leuze, M.R., Voy, B.H.: High performance computational tools for motif discovery. In: Proceedings of the 18th IPDPS, Eldorado Hotel, Santa Fe, NM, USA, April 2004, IEEE, Los Alamitos (2004)Google Scholar
- 19.ParSeq: A Software Tool for Searching Motifs with Structural and Biochemical Properties in Biological Sequences (September 2005), http://www-pr.informatik.uni-tuebingen.de/parseq/
- 20.Cytochrome P450 cysteine heme-iron ligand signature, http://www.expasy.org/cgi-bin/nicedoc.pl?PDOC00081