Skip to main content
Log in

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Discovering Transcription Factor Binding Sites (TFBS) has immense significance in terms of developing techniques and evaluating regulatory processes in biological systems. The DNA gene sequence encompasses large volume of datasets so a new methodology is needed to analyze them in the quickest possible time. Over the past decades, the planted (ℓ, d) motif discovery methodology has been used for locating TFBS in the genetic region. This paper focuses on developing a new approach for motif identification using planted (ℓ, d) motif discovery algorithm. The proposed algorithm is named ESMD (Emerging Substring based Motif Detection), which is based on two processes: Mining and Combining Emerging Substrings. In the mining step, an array is initially created, based on the suffix array (SA) and the longest common prefix array (LCP). A MapReduce programming model handles the mining of emerging substring process since DNA gene sequences constitute huge data. The next step combines the emerging substrings of different lengths. The resulting models have been evaluated using two different metrics, the Pearson Correlation Coefficient (PCC) and the Area Under Curve (AUC). Both have produced much better results than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86

    Article  MathSciNet  Google Scholar 

  • Adzitey F, Huda N, Ali GRR (2013) Molecular techniques for detecting and typing of bacteria, advantages and application to foodborne pathogens isolated from ducks. 3 Biotech 3(2):97–107

    Article  Google Scholar 

  • Afolabi IT, Makinde OS, Oladipupo O (2019) Semantic web mining for content-based online shopping recommender systems. Int J Intell Inf Technol 15(4):40–51

    Article  Google Scholar 

  • Bagui S, Devulapalli K, John S (2020) MapReduce implementation of a multinomial and mixed naive bayes classifier. Int J Intell Inf Technol 16(2):37–46

    Article  Google Scholar 

  • Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. PubMed 2:28–36

    Google Scholar 

  • Bailey TL (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27(12):1653–1659

    Article  Google Scholar 

  • Balasubramanian S, Geetha TV (2019) A new dynamic neighbourhood-based semantic dissimilarity measure for ontology. Int J Intell Inf Technol 15(3):24–41

    Article  Google Scholar 

  • Bandopadhyay S, Sahni S, Rajasekaran S (2013) PMS6MC: a multicore algorithm for motif discovery. Algorithms 6(4):805–823

    Article  Google Scholar 

  • Brohee S, Helden JV (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinform 7(1):488–506

    Article  Google Scholar 

  • Cheng C, Min R, Gerstein M (2011) TIP: a probabilistic method for identifying transcription factor target genes from ChIP-seq binding profiles. Bioinformatics 27(23):3221–3227

    Article  Google Scholar 

  • Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 493–498.

  • Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  • Dongen SV (2000) Graph clustering by flow simulation. Ph.D. dissertation, University of Utrecht, Utrecht, The Netherlands

  • Edwards RJ, Davey NE, Shields DC (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE 2(10):e967

    Article  Google Scholar 

  • Evans PA, Smith AD, Wareham HT (2003) On the complexity of finding common approximate substring. Theor Comput Sci 306(3):407–430

    Article  MathSciNet  Google Scholar 

  • Fatehi K, Rezvani M, Fateh M, Pajoohan MR (2018) Subspace clustering for high-dimensional data using cluster structure similarity. Int J Intell Inf Technol 14(3):38–55

    Article  Google Scholar 

  • Fischer J, Mäkinen V, Välimäki N (2008) Space efficient string mining under frequency constraints. In: Data mining 2008 ICDM'08 eighth IEEE international conference on IEEE, pp 193–202

  • Gary DS, Thomas DS, Larry G, Andrzej E (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011

    Article  Google Scholar 

  • Gayathri KS, Easwarakumar KS, Elias S (2020) Fuzzy ontology based activity recognition for assistive health care using smart home. Int J Intell Inf Technol 16(1):17–31

    Article  Google Scholar 

  • Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296(5):1205–1214

    Article  Google Scholar 

  • Karin M (1990) Too many transcription factors: positive and negative interactions. New Biol 2(2):126–131

    Google Scholar 

  • Lawrence CE, Andrew AR (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct Funct Bioinform 7(1):41–51

    Article  Google Scholar 

  • Lawrence CE, Altschul SF, Boguski M, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214

    Article  Google Scholar 

  • Liu R, Blackwell TW, States DJ (2001) Conformational model for binding site recognition by the E. coli MetJ transcription factor’. Bioinformatics 17(7):622–633

    Article  Google Scholar 

  • Maeder ML, Gersbach CA (2016) Genome-editing technologies for gene and cell therapy. Mol Ther 24(3):430–446

    Article  Google Scholar 

  • Manber U, Myers G (1990) Suffix arrays: a new method for online string searches. Proceedings of the first annual ACM-SIAM symposium on discrete algorithms, 319–327

  • Pevzner PA, Sze SH (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proc ISMB 8:269–278

    Google Scholar 

  • Quang D, Xie X (2014) EXTREME: an online EM algorithm for motif discovery. Bioinformatics 30(12):1667–1673

    Article  Google Scholar 

  • Sharov AA, Ko MSH (2009) Exhaustive search for overrepresented DNA sequence motifs with CisFinder. DNA Res 16(5):261–273

    Article  Google Scholar 

  • Shida K (2006) GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinform 7(1):486–503

    Article  Google Scholar 

  • Sinha S (2006) On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22(14):e454–e463

    Article  Google Scholar 

  • Taft RJ, Pang KC, Mercer TR, Dinger M, Mattick JS (2010) Non-coding RNAs: regulators of disease. J Pathol 220(2):126–139

    Article  Google Scholar 

  • Thijs G, Lescot M, Marchal K, Rombauts S, Moor B, Rouze P, Moreau Y (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17(12):1113–1122

    Article  Google Scholar 

  • Wei W, Yu XD (2007) Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. Genom Proteom Bioinform 5(2):131–142

    Article  Google Scholar 

  • Yu Q, Huo H, Zhang Y, Guo H (2012) PairMotif: a new pattern-driven algorithm for planted (l, d) DNA motif search. PLoS ONE 7(10):e48442

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vijayan Sugumaran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Masood, M.M.D., Arunarani, A.R., Manjula, D. et al. An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets. J Ambient Intell Human Comput 12, 485–495 (2021). https://doi.org/10.1007/s12652-020-02013-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-02013-y

Keywords

Navigation