FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA $$N^6$$ -Methyladenine Sites

Rahman, Md. Khaledur

doi:10.1007/978-3-030-46165-2_5

Md. Khaledur Rahman ORCID: orcid.org/0000-0002-8784-5406¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12029))

Included in the following conference series:

International Conference on Computational Advances in Bio and Medical Sciences

Abstract

$N^6$-methyladenine is widely found in both prokaryotes and eukaryotes. It is responsible for many biological processes including prokaryotic defense system and human diseases. So, it is important to know its correct location in genome which may play a significant role in different biological functions. Few computational tools exist to serve this purpose but they are computationally expensive and still there is scope to improve accuracy. An informative feature extraction pipeline from genome sequences is the heart of these tools as well as for many other bioinformatics tools. But it becomes reasonably expensive for sequential approaches when the size of data is large. Hence, a scalable parallel approach is highly desirable. In this paper, we have developed a new tool, called FastFeatGen, emphasizing both developing a parallel feature extraction technique and improving accuracy using machine learning methods. We have implemented our feature extraction approach using shared memory parallelism which achieves around 10$\times $ speed over the sequential one. Then we have employed an exploratory feature selection technique which helps to find more relevant features that can be fed to machine learning methods. We have employed Extra-Tree Classifier (ETC) in FastFeatGen and performed experiments on rice and mouse genomes. Our experimental results achieve accuracy of 85.57% and 96.64%, respectively, which are better or competitive to current state-of-the-art methods. Our shared memory based tool can also serve queries much faster than sequential technique. All source codes and datasets are available at https://github.com/khaled-rahman/FastFeatGen.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ncbi.nlm.nih.gov/geo/.

References

Luo, G.-Z., Blanco, M.A., Greer, E.L., He, C., Shi, Y.: DNA $N^6$-methyladenine: a new epigenetic mark in eukaryotes? Nat. Rev. Mol. Cell Biol. 16(12), 705 (2015)
Article Google Scholar
Greer, E.L., et al.: DNA methylation on N$^6$-adenine in C. elegans. Cell 161(4), 868–878 (2015)
Article MathSciNet Google Scholar
Zhang, G., et al.: N$^6$-methyladenine DNA modification in Drosophila. Cell 161(4), 893–906 (2015)
Article Google Scholar
Lichinchi, G., et al.: Dynamics of the human and viral m$^6$A RNA methylomes during HIV-1 infection of T cells. Nat. Microbiol. 1(4), 16011 (2016)
Article Google Scholar
Lichinchi, G., et al.: Dynamics of human and viral RNA methylation during Zika virus infection. Cell Host Microbe 20(5), 666–673 (2016)
Article Google Scholar
Xiao, C.-L., et al.: N$^6$-methyladenine DNA modification in the human genome. Mol. Cell 71(2), 306–318 (2018)
Article Google Scholar
Fu, Y., et al.: N$^6$-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas. Cell 161(4), 879–892 (2015)
Article Google Scholar
Frelon, S., Douki, T., Ravanat, J.-L., Pouget, J.-P., Tornabene, C., Cadet, J.: High-performance liquid chromatography- tandem mass spectrometry measurement of radiation-induced base damage to isolated and cellular DNA. Chem. Res. Toxicol. 13(10), 1002–1010 (2000)
Article Google Scholar
Roberts, R.J., Macelis, D.: Rebase—restriction enzymes and methylases. Nucleic Acids Res. 29(1), 268–269 (2001)
Article Google Scholar
Flusberg, B.A., et al.: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7(6), 461 (2010)
Article Google Scholar
Fang, G., et al.: Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat. Biotechnol. 30(12), 1232 (2012)
Article Google Scholar
Krais, A.M., Cornelius, M.G., Schmeiser, H.H.: Genomic N$^6$-methyladenine determination by MEKC with LIF. Electrophoresis 31(21), 3548–3551 (2010)
Article Google Scholar
Chen, W., Lv, H., Nie, F., Lin, H.: i6mA-Pred: identifying DNA N$^6$-methyladenine sites in the rice genome. Bioinformatics 35(16), 2796–2800 (2019)
Article Google Scholar
Feng, P., Yang, H., Ding, H., Lin, H., Chen, W., Chou, K.-C.: iDNA6mA-PseKNC: identifying DNA N$^6$-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111(1), 96–102 (2019)
Article Google Scholar
Tahir, M., Tayara, H., Chong, K.T.: iDNA6mA (5-step rule): identification of DNA N$^6$-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemometrics and Intelligent Laboratory Systems (2019)
Google Scholar
Doench, J.G., et al.: Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34(2), 184 (2016)
Article Google Scholar
Rahman, M.K., Rahman, M.S.: CRISPRpred: a flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. PLoS ONE 12(8), e0181943 (2017)
Article Google Scholar
Manavalan, B., Lee, J.: SVMQA: support–vector-machine-based protein single-model quality assessment. Bioinformatics 33(16), 2496–2503 (2017)
Article Google Scholar
Chou, K.-C.: Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273(1), 236–247 (2011)
Article MathSciNet MATH Google Scholar
Rahman, M.S., Rahman, M.K., Kaykobad, M., Rahman, M.S.: isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif. Intell. Med. 84, 90–100 (2018)
Article Google Scholar
Rahman, M.S., Rahman, M.K., Saha, S., Kaykobad, M., Rahman, M.S.: Antigenic: an improved prediction model of protective antigens. Artif. Intell. Med. 94, 28–41 (2019)
Article Google Scholar
Cao, D.-S., Xu, Q.-S., Liang, Y.-Z.: propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29(7), 960–962 (2013)
Article Google Scholar
Liu, B., Liu, F., Fang, L., Wang, X., Chou, K.-C.: repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8), 1307–1309 (2014)
Article Google Scholar
Liu, B.: BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. (2017)
Google Scholar
Schauer, B.: Multicore processors–a necessity. In: ProQuest Discovery Guides, pp. 1–14 (2008)
Google Scholar
Blake, G., Dreslinski, R.G., Mudge, T.: A survey of multicore processors. IEEE Signal Process. Mag. 26(6), 26–37 (2009)
Article Google Scholar
Larranaga, P., et al.: Machine learning in bioinformatics. Brief. Bioinform. 7(1), 86–112 (2006)
Article MathSciNet Google Scholar
Stephenson, N., et al.: Survey of machine learning techniques in drug discovery. Curr. Drug Metab. 20(3), 185–193 (2019)
Article Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
Article MATH Google Scholar
Zhou, C., et al.: Identification and analysis of adenine $N^6$-methylation sites in the rice genome. Nat. Plants 4(8), 554 (2018)
Article Google Scholar
Ye, P., Luan, Y., Chen, K., Liu, Y., Xiao, C., Xie, Z.: MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing. Nucleic Acids Res. 45, D85–D89 (2016). https://doi.org/10.1093/nar/gkw95
Article Google Scholar
Shao, J., Xu, D., Tsai, S.-N., Wang, Y., Ngai, S.-M.: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS ONE 4(3), e4920 (2009)
Article Google Scholar
Cawley, G.C., Talbot, N.L.: On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Indiana University Bloomington, Bloomington, IN, 47408, USA
Md. Khaledur Rahman

Authors

Md. Khaledur Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Khaledur Rahman .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Ion Măndoiu
Virginia Tech, Blacksburg, VA, USA
T. M. Murali
Florida International University, Miami, FL, USA
Giri Narasimhan
University of Connecticut, Storrs, CT, USA
Sanguthevar Rajasekaran
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rahman, M.K. (2020). FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA $N^6$-Methyladenine Sites. In: Măndoiu, I., Murali, T., Narasimhan, G., Rajasekaran, S., Skums, P., Zelikovsky, A. (eds) Computational Advances in Bio and Medical Sciences. ICCABS 2019. Lecture Notes in Computer Science(), vol 12029. Springer, Cham. https://doi.org/10.1007/978-3-030-46165-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-46165-2_5
Published: 29 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46164-5
Online ISBN: 978-3-030-46165-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA \(N^6\)-Methyladenine Sites

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

FastFeatGen: Faster Parallel Feature Extraction from Genome Sequences and Efficient Prediction of DNA \(N^6\)-Methyladenine Sites

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation