An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Masood, M. Mohamed Divan; Arunarani, A. R.; Manjula, D.; Sugumaran, Vijayan

doi:10.1007/s12652-020-02013-y

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Original Research
Published: 01 May 2020

Volume 12, pages 485–495, (2021)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

M. Mohamed Divan Masood¹,
A. R. Arunarani²,
D. Manjula² &
…
Vijayan Sugumaran^3,4

281 Accesses
2 Citations
Explore all metrics

Abstract

Discovering Transcription Factor Binding Sites (TFBS) has immense significance in terms of developing techniques and evaluating regulatory processes in biological systems. The DNA gene sequence encompasses large volume of datasets so a new methodology is needed to analyze them in the quickest possible time. Over the past decades, the planted (ℓ, d) motif discovery methodology has been used for locating TFBS in the genetic region. This paper focuses on developing a new approach for motif identification using planted (ℓ, d) motif discovery algorithm. The proposed algorithm is named ESMD (Emerging Substring based Motif Detection), which is based on two processes: Mining and Combining Emerging Substrings. In the mining step, an array is initially created, based on the suffix array (SA) and the longest common prefix array (LCP). A MapReduce programming model handles the mining of emerging substring process since DNA gene sequences constitute huge data. The next step combines the emerging substrings of different lengths. The resulting models have been evaluated using two different metrics, the Pearson Correlation Coefficient (PCC) and the Area Under Curve (AUC). Both have produced much better results than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

RNA-Seq Data Analysis in Galaxy

Introduction to Bioinformatics

References

Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms 2(1):53–86
Article MathSciNet Google Scholar
Adzitey F, Huda N, Ali GRR (2013) Molecular techniques for detecting and typing of bacteria, advantages and application to foodborne pathogens isolated from ducks. 3 Biotech 3(2):97–107
Article Google Scholar
Afolabi IT, Makinde OS, Oladipupo O (2019) Semantic web mining for content-based online shopping recommender systems. Int J Intell Inf Technol 15(4):40–51
Article Google Scholar
Bagui S, Devulapalli K, John S (2020) MapReduce implementation of a multinomial and mixed naive bayes classifier. Int J Intell Inf Technol 16(2):37–46
Article Google Scholar
Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. PubMed 2:28–36
Google Scholar
Bailey TL (2011) DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27(12):1653–1659
Article Google Scholar
Balasubramanian S, Geetha TV (2019) A new dynamic neighbourhood-based semantic dissimilarity measure for ontology. Int J Intell Inf Technol 15(3):24–41
Article Google Scholar
Bandopadhyay S, Sahni S, Rajasekaran S (2013) PMS6MC: a multicore algorithm for motif discovery. Algorithms 6(4):805–823
Article Google Scholar
Brohee S, Helden JV (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinform 7(1):488–506
Article Google Scholar
Cheng C, Min R, Gerstein M (2011) TIP: a probabilistic method for identifying transcription factor target genes from ChIP-seq binding profiles. Bioinformatics 27(23):3221–3227
Article Google Scholar
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 493–498.
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Dongen SV (2000) Graph clustering by flow simulation. Ph.D. dissertation, University of Utrecht, Utrecht, The Netherlands
Edwards RJ, Davey NE, Shields DC (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE 2(10):e967
Article Google Scholar
Evans PA, Smith AD, Wareham HT (2003) On the complexity of finding common approximate substring. Theor Comput Sci 306(3):407–430
Article MathSciNet Google Scholar
Fatehi K, Rezvani M, Fateh M, Pajoohan MR (2018) Subspace clustering for high-dimensional data using cluster structure similarity. Int J Intell Inf Technol 14(3):38–55
Article Google Scholar
Fischer J, Mäkinen V, Välimäki N (2008) Space efficient string mining under frequency constraints. In: Data mining 2008 ICDM'08 eighth IEEE international conference on IEEE, pp 193–202
Gary DS, Thomas DS, Larry G, Andrzej E (1982) Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10(9):2997–3011
Article Google Scholar
Gayathri KS, Easwarakumar KS, Elias S (2020) Fuzzy ontology based activity recognition for assistive health care using smart home. Int J Intell Inf Technol 16(1):17–31
Article Google Scholar
Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296(5):1205–1214
Article Google Scholar
Karin M (1990) Too many transcription factors: positive and negative interactions. New Biol 2(2):126–131
Google Scholar
Lawrence CE, Andrew AR (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct Funct Bioinform 7(1):41–51
Article Google Scholar
Lawrence CE, Altschul SF, Boguski M, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Article Google Scholar
Liu R, Blackwell TW, States DJ (2001) Conformational model for binding site recognition by the E. coli MetJ transcription factor’. Bioinformatics 17(7):622–633
Article Google Scholar
Maeder ML, Gersbach CA (2016) Genome-editing technologies for gene and cell therapy. Mol Ther 24(3):430–446
Article Google Scholar
Manber U, Myers G (1990) Suffix arrays: a new method for online string searches. Proceedings of the first annual ACM-SIAM symposium on discrete algorithms, 319–327
Pevzner PA, Sze SH (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proc ISMB 8:269–278
Google Scholar
Quang D, Xie X (2014) EXTREME: an online EM algorithm for motif discovery. Bioinformatics 30(12):1667–1673
Article Google Scholar
Sharov AA, Ko MSH (2009) Exhaustive search for overrepresented DNA sequence motifs with CisFinder. DNA Res 16(5):261–273
Article Google Scholar
Shida K (2006) GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinform 7(1):486–503
Article Google Scholar
Sinha S (2006) On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22(14):e454–e463
Article Google Scholar
Taft RJ, Pang KC, Mercer TR, Dinger M, Mattick JS (2010) Non-coding RNAs: regulators of disease. J Pathol 220(2):126–139
Article Google Scholar
Thijs G, Lescot M, Marchal K, Rombauts S, Moor B, Rouze P, Moreau Y (2001) A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17(12):1113–1122
Article Google Scholar
Wei W, Yu XD (2007) Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. Genom Proteom Bioinform 5(2):131–142
Article Google Scholar
Yu Q, Huo H, Zhang Y, Guo H (2012) PairMotif: a new pattern-driven algorithm for planted (l, d) DNA motif search. PLoS ONE 7(10):e48442
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, B S Abdur Rahman Crescent Institute of Science and Technology, Chennai, India
M. Mohamed Divan Masood
Department of Computer Science and Engineering, Anna University, Chennai, India
A. R. Arunarani & D. Manjula
Department of Decision and Information Sciences, Oakland University, Rochester, MI, USA
Vijayan Sugumaran
Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA
Vijayan Sugumaran

Authors

M. Mohamed Divan Masood
View author publications
You can also search for this author in PubMed Google Scholar
A. R. Arunarani
View author publications
You can also search for this author in PubMed Google Scholar
D. Manjula
View author publications
You can also search for this author in PubMed Google Scholar
Vijayan Sugumaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vijayan Sugumaran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masood, M.M.D., Arunarani, A.R., Manjula, D. et al. An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets. J Ambient Intell Human Comput 12, 485–495 (2021). https://doi.org/10.1007/s12652-020-02013-y

Download citation

Received: 02 August 2019
Accepted: 17 April 2020
Published: 01 May 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s12652-020-02013-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Abstract

Access this article

Similar content being viewed by others

A survey of best practices for RNA-seq data analysis

RNA-Seq Data Analysis in Galaxy

Introduction to Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient algorithm for identifying (ℓ, d) motif from huge DNA datasets

Abstract

Access this article

Similar content being viewed by others

A survey of best practices for RNA-seq data analysis

RNA-Seq Data Analysis in Galaxy

Introduction to Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation