Abstract
A comprehensive understanding of transcription factor binding sites (TFBSs) is a key problem in contemporary biology, which is a critical issue in gene regulation. Identifying a pattern of TFBSs in every DNA sequence, motif discovery reveals the basic regulatory relationship and compassionate the evolutionary system of every species. In this case, recognizing the high-quality (ℓ, d) motif is a great challenge. This problem is addressed in motif discovery and motif finding, using the proposed algorithms, such as Segmentation to Filtration (S2F) and Firefly with FREEZE (FFF), respectively. In this study, the whole DNA sequences are divided into two segments. Segment 1 involves motif discovery and is sliced by base and sub k-mers applying an iterative approach, followed by filtration 1 and 2 techniques, respectively. This approach obtains the top five percent of the best motifs (TOPbk_mer) based on accuracy. In segment 2, the motifs recognized in segment 1 are given as input to the FFF algorithm to identify the TFBs locations. The standard firefly algorithm with two freezing techniques, local and global, is employed to recognize the final motif. The performance of these algorithms is evaluated on the simulated datasets and real datasets such as the Escherichia coli cyclic AMP receptor protein (CRP) dataset, mouse Embryonic Stem Cell (mESC) dataset, and human species ChIP-seq (Chromatin Immuno Precipitation Sequences) dataset. All of these datasets have a running time of the experiment within 3 min, and the sequence numbers (t) hold ranges up to 39,601. It is evident from the results that the two proposed algorithms, S2F and FFF, can identify the high-quality motif, and it is faster than the state-of-the-art PMS and QPMS algorithms.
Similar content being viewed by others
Data availability
1. The mESC data was downloaded from https://lgsun.grc.nia.nih.gov/CisFinder/, the web version of CisFinder. 2. For the ENCODE TF ChIP-seq data, homo sapiens (hg19) datasets were utilized and retrieve them with the following steps: a. Download the datasets of the narrow Peak format from ucsc http://genome.ucsc.edu/ENCODE/downloads.html. b. Convert the narrow peak format to the FASTA format. c. Find the web logo of the TFBSs from the JASPAR database. http://compbio.mit.edu/encode-motif.
References
Abbass MM, Bahig HM (2013) An efficient algorithm to identify DNA motifs. Math Comput Sci 7(4):387–399
Bailey TL, Williams N, Misleh C et al (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 34:369–373
Bailey TL, Boden M, Buske A et al (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(May):202–208
Bandyopadhyay S, Sahni S and Rajasekaran S (2012) PMS6: a fast algorithm for motif discovery. In: 2012 IEEE 2nd Int. Conf. Comput. Adv. Bio Med. Sci. ICCABS 2012, pp 1–6. https://doi.org/10.1109/ICCABS.2012.6182627
Boeva V et al (2010) De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38(11):1–9. https://doi.org/10.1093/nar/gkq217
Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242
Chaudhry MU, Lee JH (2018) Feature selection for high dimensional data using monte carlo tree search. IEEE Access 6:76036–76048. https://doi.org/10.1109/ACCESS.2018.2883537
Chen X et al (2008) Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133(6):1106–1117. https://doi.org/10.1016/j.cell.2008.04.043
Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14(6):1188–1190. https://doi.org/10.1101/gr.849004
Davila J, Balla S, Rajasekaran S (2007) Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans Comput Biol Bioinforma 4(4):544–552. https://doi.org/10.1109/TCBB.2007.70241
Davoudi A et al (2021) Studying the effect of taking statins before infection in the severity reduction of COVID-19 with machine learning. Biomed Res Int. https://doi.org/10.1155/2021/9995073
Dinh H, Rajasekaran S, Kundeti VK (2011) PMS5: an efficient exact algorithm for the (ℓ, d)-motif finding problem. BMC Bioinform 12:1–10. https://doi.org/10.1186/1471-2105-12-410
Dinh H, Rajasekaran S, Davila J (2012) qPMS7: a fast algorithm for finding (ℓ, d)-motifs in DNA and protein sequences. PLoS ONE. https://doi.org/10.1371/journal.pone.0041425
Dong Z (2020) An overview of sequence logo technique and potential application direction. Front Soc Sci Technol 2(11):51–57. https://doi.org/10.25236/FSST.2020.021109
Dos Santos Coelho L, De Andrade Bernert DL and Mariani VC (2011) A chaotic firefly algorithm applied to reliability-redundancy optimization. In: 2011 IEEE Congr. Evol. Comput. CEC 2011, pp 517–521. https://doi.org/10.1109/CEC.2011.5949662
Fathimathul RPP et al (2022) A novel method for the classification of butterfly species using pre-trained CNN models. Electron 11(13):1–20. https://doi.org/10.3390/electronics11132016
Federico M, Valente P, Leoncini M, Montangero M and Cavicchioli R (2009) An efficient algorithm for planted structured motif extraction. In: Comput. Front. 2009—Proc. Conf. Co-Located Work. CompBio 2009, pp 1–6. https://doi.org/10.1145/1531780.1531782
Fratkin E, Naughton BT, Brutlag DL, Batzoglou S (2006) MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22(14):150–157. https://doi.org/10.1093/bioinformatics/btl243
Ge H, Yu J, Sun L, Wang Z, Yao Y (2019) Discovery of DNA Motif utilising an integrated strategy based on random projection and particle swarm optimization. Math Probl Eng. https://doi.org/10.1155/2019/3854646
Hashim FA, Mabrouk MS and Al-Atabany W (2019) Review of different sequence motif finding algorithms. Avicenna J Med Biotechnol 11(2): 130–148. Available: http://www.ncbi.nlm.nih.gov/pubmed/31057715%0A; http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC6490410
Ho ES, Jakubowski D, Gundeson SI (2009) iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms Mol Biol 4(1):1–14
Hu J, Li B, Kihara D (2005) Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 33(15):4899–4913
Huang C, Lee W, Hsieh S (2011) An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans Comput Biol Bioinform 8(4):959–975
Jayaram N, Usvyat D, Martin AC (2016) Evaluating tools for transcription factor binding site prediction. BMC Bioinform 17(1):1–12. https://doi.org/10.1186/s12859-016-1298-9
Jia C, Carson MB, Wang Y, Lin Y, Lu H (2014) A new exhaustive method and strategy for finding motifs in ChIP-enriched regions. PLoS ONE. https://doi.org/10.1371/journal.pone.0086044
Khamis AM et al (2018) A novel method for improved accuracy of transcription factor binding site prediction. Nucleic Acids Res. https://doi.org/10.1093/nar/gky237
Khan A, Fones O, Stigliani A et al (2018) JASPAR 2018: Update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res 46(D1):D260–D266
Kheradpour P, Kellis M (2014) Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res 42(5):2967–2987
Kion N, Li X, Wang D (2018) A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci 466:25–43. https://doi.org/10.1016/j.ins.2018.07.004
Krause J, Cordeiro J, Parpinelli RS, Lopes HSA (2013) A survey of swarm algorithms applied to discrete optimization problems. Swarm Intell Bio-Inspired Comput. https://doi.org/10.1016/B978-0-12-405163-8.00007-7
Lawrence JCW, Charles E, Altschul SF, Boguski MS, Liu JS, Neuwald AF (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Lee NK, Wang D (2011) SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model. BMC Bioinform. https://doi.org/10.1186/1471-2105-12-S1-S16
Li L (2009) GADEM: a genetic algorithm guided formation of spaced dyads coupled with an em algorithm for motif discovery. J Comput Biol 16(2):317–329. https://doi.org/10.1089/cmb.2008.16TT
Liu B, Yang J, Li Y, McDermaid A, Ma Q (2018) An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 19(5):1069–1081. https://doi.org/10.1093/bib/bbx026
Machanick P, Bailey TL (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27(12):1696–1697. https://doi.org/10.1093/bioinformatics/btr189
Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS (2005) Transcription factor binding site identification using the self-organizing map. Bioinformatics 21(9):1807–1814. https://doi.org/10.1093/bioinformatics/bti256
Marschall T, Rahmann S (2009) Efficient exact motif discovery. Bioinformatics 25(12):356–364. https://doi.org/10.1093/bioinformatics/btp188
Nicolae M, Rajasekaran S (2014) Efficient sequential and parallel algorithms for planted motif search. BMC Bioinform 15(1):34
Nicolae M, Rajasekaran S (2015) QPMS9: an efficient algorithm for quorum planted motif search. Sci Rep 5:1–8. https://doi.org/10.1038/srep07813
Pavesi G, Mereghetti P, Mauri G, Pesole G (2004) Weeder web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32(Web Server ISS):199–203. https://doi.org/10.1093/nar/gkh465
Pevzner S-HS, Pavel A (2000) Combinatorial approaches to finding subtle signals in DNA sequences. ISMB 8:21–29
Pisanti N et al (2006) RISOTTO: fast extraction of motifs with mismatches. Lat Am Symp Theor Inform. https://doi.org/10.1007/11682462_69
Quan H, Yoke M, Jing W et al (2010) RecMotif: a novel fast algorithm for weak motif discovery. BMC Bioinform 11(Suppl 11):1–11
Reddy US, Arock M and Reddy AV (2013) A particle swarm optimization solution for challenging planted(l, d)-Motif problem. In: Proc. IEEE Symp Comput Intell Bioinforma Comput Biol. CIBCB 2013, pp 222–229
Schneider TD (2002) Consensus sequence zen. Appl Bioinf 1:111–119
Sharifi A, Ahmadi M, Mehni MA, Jafarzadeh Ghoushchi S, Pourasad Y (2021) Experimental and numerical diagnosis of fatigue foot using convolutional neural network. Comput Methods Biomech Biomed Eng 24(16):1828–1840. https://doi.org/10.1080/10255842.2021.1921164
Shi EY (2001) Particle swarm optimization: developments, applications and resources. In: Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546), pp 81–86
Shrimankar DD (2019) High performance computing approach for DNA motif discovery. CSI Trans ICT 7(4):295–297. https://doi.org/10.1007/s40012-019-00235-w
Som-in S, Kimpan W (2018) Enhancing of particle swarm optimization based method for multiple motifs detection in DNA sequences collections. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2018.2872978
Srinivasulu Reddy U, Arock M, Reddy AV (2010) Planted (l, d)—otif finding using particle swarm optimization. Int J Comput Appl 1(2):51–56. https://doi.org/10.5120/1541-144
Sun HQ, Low MY, Hsu WJ, Tan CW, Rajapakse JC (2011) Tree-structured algorithm for long weak motif discovery. Bioinformatics 27(19):2641–2647
Sun C, Yang Y, Wang H et al (2019) A clustering approach for motif discovery in ChIP-Seq dataset. Entropy 21(8):1–14
Sun X, Tan Y, Wu Q, Chen B, Shen C (2019) TM-miner: TFS-based algorithm for mining temporal motifs in large temporal network. IEEE Access 7:49778–49789. https://doi.org/10.1109/ACCESS.2019.2911181
Tanaka S (2014) Improved exact enumerative algorithms for the planted (l, d)-motif search problem. IEEE/ACM Trans Comput Biol Bioinforma 11(2):361–374
Theepalakshmi P, Reddy US (2022a) Freezing firefly algorithm for efficient planted (ℓ, d) motif search. Med Biol Eng Comput 60(2):511–530. https://doi.org/10.1007/s11517-021-02468-x
Theepalakshmi P, Reddy US (2022b) Planted (l, d) motif search using Bat algorithm with inertia weight and opposition based learning. Int J Inf Technol 14(7):3555–3563. https://doi.org/10.1007/s41870-022-00923-y
van Laarhoven PJM and Aarts EHL (1987) Chapter 2 Simulated annealing 2.1 Introduction of the algorithm. In: Simulated Annealing Theory Appl., p. 7. Available: https://link-springer-com.ezproxy2.library.colostate.edu/content/pdf/10.1007%2F978-94-015-7744-1_2.pdf
Xiao P, Pal S, Rajasekaran S (2017) Randomised sequential and parallel algorithms for efficient quorum planted motif search. Int J Data Min Bioinform 18(2):105–124. https://doi.org/10.1504/IJDMB.2017.086457
Xing Z, Tu S (2020) A Graph neural network assisted monte carlo tree search approach to traveling salesman problem. IEEE Access 8(June):108418–108428. https://doi.org/10.1109/ACCESS.2020.3000236
Xu Y, Yang J, Zhao Y, Shang Y (2013) An improved voting algorithm for planted (l, d) motif search. Inf Sci (ny) 237:305–312. https://doi.org/10.1016/j.ins.2013.03.023
Yang XS (2009) Firefly algorithms for multimodal optimization. In: Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5792 LNCS, pp 169–178. https://doi.org/10.1007/978-3-642-04944-6_14
Yang X-S (2010) Firefy algorithm, lévy fights and global optimization. In: Bramer M, Ellis R (eds) Petridis M Res. Dev. Intell. Syst. XXVI. Springer, London, pp 209–218
Yang X, Rajapakse JC (2004) Graphical approach to weak motif recognition. In: Genome informatics. International Conference on Genome Informatics, pp 52–62
Yu QZ, Huo H (2013) PairMotif+: a fast and effective algorithm for de novo motif discovery in DNA sequences. Int J Biol Sci 9(4):412–424
Yu Q, Zhang X (2019) A new efficient algorithm for quorum planted motif search on large DNA datasets. IEEE Access 7:129617–129626. https://doi.org/10.1109/access.2019.2940115
Yu Q, Huo H, Vitter JS, Huan J, Nekrich Y (2015) An efficient exact algorithm for the motif stem search problem over large alphabets. IEEE/ACM Trans Comput Biol Bioinforma 12(2):384–397. https://doi.org/10.1109/TCBB.2014.2361668
Yu Q, Huo H, Zhao R, Feng D, Vitter JS, Huan J (2016) RefSelect: a reference sequence selection algorithm for planted (l, d) motif search. BMC Bioinform. https://doi.org/10.1186/s12859-016-1130-6
Yu Q, Wei D, Huo H (2018) SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinform 19(1):1–16. https://doi.org/10.1186/s12859-018-2242-y
Yu S, Xia F, Sun Y, Tang T, Yan X, Lee I (2021) Detecting outlier patterns with query-based artificially generated searching conditions. IEEE Trans Comput Soc Syst 8(1):134–147. https://doi.org/10.1109/TCSS.2020.2977958
Yuan X, Gao M, Bai J, Duan J (2018) SVSR: a Program to simulate structural variations and generate sequencing reads for multiple platforms. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2018.2876527
Zambelli F, Pesole G, Pavesi G (2013) Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 14(2):225–237. https://doi.org/10.1093/bib/bbs016
Zhao H, Xu X, Song Y, Lee DL, Chen Z, Gao H (2021) Ranking users in social networks with motif-based pagerank. IEEE Trans Knowl Data Eng 33(5):2179–2192. https://doi.org/10.1109/TKDE.2019.2953264
Zhu L, Zhang D-SH (2018) DiscMLA: an efficient discriminative motif learning algorithm over high-throughput datasets. IEEE/ACM Trans Comput Biol Bioinform 15(6):1810–1820
Zia A, Moses M (2012) Towards a theoretical understanding of false positives in DNA motif finding. BMC Bioinform 13(1):1–9
Funding
No funding is involved.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript.
Ethical approval
All the authors mentioned in the manuscript have agreed to authorship, read and approved the manuscript, and given consent for submission and subsequent publication of the manuscript. The order of authorship is agreed upon by all named authors prior to submission. Full names, institutional affiliations, highest degree obtained by the authors, and e-mail address are clearly mentioned on the title page. The corresponding author, who takes full ownership of all the communication related to the manuscript, be designated and his/her detailed institutional affiliation is provided. Manuscript submission-related declarations: The manuscript in part or in full has not been submitted or published anywhere. The manuscript will not be submitted elsewhere until the editorial process is completed. Statements of ethical approval for studies involving human subjects and/or animals: This article doesn’t involve human subjects and/or animals.
Informed consent
For this type of study, informed consent is not required.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Theepalakshmi, P., Reddy, U.S. A new efficient quorum planted (ℓ, d) motif search on ChIP-seq dataset using segmentation to filtration and freezing firefly algorithms. Soft Comput 28, 3049–3070 (2024). https://doi.org/10.1007/s00500-023-09236-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-09236-z