Abstract
Unraveling the mechanisms that regulate gene expression is a major challenge in biology. An important task in this challenge is to identify regulatory elements, especially the binding sites in deoxyribonucleic acid (DNA) for transcription factors. These binding sites are short DNA segments that are called motifs. The motifs are short, recurring patterns in DNA sequences that are presumed to have a biological function. Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic sequences have been available. Recent advances in genome sequence availability and in high throughput gene expression analysis technologies have allowed for the development of computational methods for motif discovery. As a result, a large number of motif finding algorithms have been implemented and applied to various motif models over the past decade. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The detection of regulatory elements from a large set of regulatory regions is a challenging problem in computational genomics. However, computational methods to extract this biological meaningful information suffer from high computational requirements. High performance computing appears as a magic bullet in this challenge. Designing a parallel algorithm to detect regulatory elements using correlation with gene expression data and its implementation with openMPI and openMP will leads to significant runtime savings on distributed system. Solving computationally intensive problems on high performance computing architecture can significantly improve and speedup the run time of the problem solution when proper task distribution, scheduling strategy and suitable parallel computing paradigms are used. Deploying more and more cluster computers can bridge the gap of speed difference between architectures and will result in fewer numbers of concurrent jobs that can be allocated to the system.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 State-of-the-art
Motif finding is a well-studied problem in computing. Various motif search algorithms have been developed, falling into two main categories: heuristic and exact. Heuristic algorithms perform an iterative local search, for instance by repeatedly refining an input sampling or projection until a motif is found. Gibbs sampling and Expectation Maximization (EM), used in the motif-finding tool MEME both use probabilistic computations to optimize an initial random alignment. [An alignment is simply a vector (a1, a2,…,an) of n positions, which predicts that the motif occurs at position ai in the given sequence Si.] Gibbs sampling tries to refine the alignment one position at a time; in contrast, EM may recompute the entire alignment in a single iteration. Projection combines a pattern-based approach with EM’s probabilistic approach, trying to guess every successive character of a tentative motif and using EM to verify its guesses. GARPS uses a random version of projection, in tandem with the genetic algorithm (GA), for yet another iterative approach. These are just some of many successful heuristic algorithms [1, 2]. However, heuristics are non-exhaustive, and thus not always guaranteed to find a solution. Exact algorithms, on the other hand, perform an exhaustive search of possible motifs and so always find the planted motif.
WINNOWER and its successor MITRA are exact algorithms that look at pairwise l-mer similarity to find motifs. In a set of DNA sequences, there are numerous pairs of “similar” l-mers, which come from different sequences and have Hamming distances of at most 2d from each other (meaning that they could be two d neighbors of the same l-mer). WINNOWER represents these pairs in a graph, with l-mers as nodes and edges connecting l-mer pairs. It then prunes the graph to identify “cliques” of pairs that indicate a motif. MITRA refines this graph representation into a mismatch tree containing all possible l-mers, organized by prefix [3, 4]. The tree structure allows MITRA to eliminate entire branches at a time, making it faster than WINNOWER at removing the spurious edges that are not part of any motif clique.
The current state-of-the-art in exact motif search is qPMS9, the most recent in a series of Planted Motif Search algorithms. It performs a sample-driven step, which generates a k-tuple of l-mers from each of k input strings, followed by a pattern-driven step, which generates the common d-neighborhood of the tuple and then checks whether any of the l-mers in this common neighborhood is a motif. To identify neighbors, qPMS9 efficiently traverses the tree of all possible l-mers, using certain pruning criteria explored by predecessors PMSPrune and qPMS7 to quickly discard non-neighbor branches. Sampling in qPMS9 is an improvement on its predecessor PMS8; in building a k-tuple, qPMS9 intelligently prioritizes l-mers that have fewer matches with the l-mers already selected, such that the common d-neighborhood becomes smaller and thus faster to check through. Finally, PMS8 and qPMS9 have been implemented to run on multiple processors, allowing them to solve instances with (l, d) as large as (50, 21) in a few hours.
2 Work done till now
We have proposed and implemented the distributed parallel computing algorithm for motif discovery problem. We have implemented a simple scalable and efficient parallel openMP and openMPI implementation for Planted Motif Search problem using cluster computer. Also we have presented the method for creating Beowulf cluster [5, 6]. The efficiency of the algorithm is validated by testing it on both simulated as well as real biological databases.
2.1 Experimental result on simulated data set
The input sequences are generated by using simulated data sets with parameters t = 20 sequences and m = 600 characters, where the characters are A, C, G, T. Each (l, d) input instance dataset is generated as follows: We generate random strings with length (m − l) each, where the characters appear randomly with equal probability [7, 8]. Then we generate randomly an l length string M and plant a copy of it in each sequence at random position after mutating it with at most d random mutations (Fig. 1; Table 1).
Figure 2 shows the scalability results for our algorithm where
We note that, our proposed method reduces the running time and the speedup achieved scales well with the increasing in number of cluster nodes.
2.2 Experimental result on real data set
We test PMS on a set of real biological data which are used in the literature. The data for this set contains the upstream DNA regions of a set of genes from different species (Table 2).
3 Future work
Solving computationally intensive problems on high performance computing architecture can significantly improve and speedup the run time of the problem solution when proper task distribution, scheduling strategy and suitable parallel computing paradigms are used. Deploying more and more cluster computers can bridge the gap of speed difference between architectures and will result in fewer numbers of concurrent jobs that can be allocated to the system. Future work may include the use of different scheduling strategies and intelligent selection criteria to choose the best scheduling strategy to solve a given computationally intensive problem. We believe that this paper is a step towards a complete system to solve computationally intensive problems on heterogeneous architectures.
3.1 Research papers submitted
-
1.
“GENOME WIDE IDENTIFICATION OF CIS–REGULATORY MOTIF USING BEOWULF CLUSTER” submitted in “IETE Journal of Research” on 6th April 2017
Status Under Review
-
2.
“REVIEW OF REGULATORY MOTIF DISCOVERY ALGORITHMS” submitted in “IETE Technical Review” on 15th August 2017
Status Under Review
References
Nicolae M, Rajasekaran S (2014) Efficient sequential and parallel algorithms for planted motif search. BMC Bioinform 15:34. https://doi.org/10.1186/1471-2105-15-34
Ikebata H, Yoshida R (2015) Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv017
Fan Y, Wu W, Liu R, Yang W (2013) An iterative algorithm for motif discovery. Procedia Comput Sci 24:25–29
Huo H, Zhao Z, Stojkovic V, Liu L (2010) Optimizing genetic algorithm for motif discovery. Math Comput Model 52:2011–2020
Kaya M (2009) MOGAMOD: multi-objective genetic algorithm for motif discovery. Expert Syst Appl 36(2):1039–1947
Huo H, Zhao Z, Stojkovic V, Liu L (2010) optimizing genetic algorithm for motif discovery. Math Comput Model 52:2011–2020
Witte De et al (2015) BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements. Bioinformatics 31:3758–3766. https://doi.org/10.1093/bioinformatics/btv466
Nicolae M, Rajasekaran S (2015) qPMS9: an efficient algorithm for quorum planted motif search. Sci Rep 5:7813
Mohantyr S, Sahu B, Acharya AK (2013) Parallel implementation of exact algorithm for planted (l,d) motif search. In: Proceedings of the international conference on advances in computer science, AETACS
Liu FM, Tsai JJ, Chen RM, Chen SN, Shih SH (2004) FMGA: finding motifs by genetic algorithm. In: The fourth IEEE symposium on bioinformatics and bioengineering, p 459
Acknowledgements
Funding was provided by Ministry of Electronics and Information technology (Grant No. YFRA).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Shrimankar, D.D. High performance computing approach for DNA motif discovery. CSIT 7, 295–297 (2019). https://doi.org/10.1007/s40012-019-00235-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40012-019-00235-w