Skip to main content

Extracting Common DNA Segments from the Complete Genomes of 7538 Viruses and Five Selected Mammals

  • Conference paper
  • First Online:
Advances in Computational Collective Intelligence (ICCCI 2024)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2165))

Included in the following conference series:

  • 128 Accesses

Abstract

This paper aims to extract common DNA consecutive sequences appearing in both of the complete genomes of viruses and that of mammals. With these common DNA sequences as genomic fossils, biologists or virologists can trace possible pathway of genomic evolution across species. To meet the requirement of huge computation to extract common DNA sequences from complete genomes of all viruses and mammals selected, this study adopts one previously developed approach that was based on MapReduce programming model. This study has experiments for extracting common DNA sequences run on a Hadoop cluster containing ten computing nodes. Experimental resources includes the whole genomic sequences of 7,538 viruses and five selected mammals, including \(\textit{Homo sapiens}\) (Human), \(\textit{Pan troglodytes}\) (Chimpanzee), \(\textit{Mus musculus}\) (House Mouse), \(\textit{Rattus norvegicus}\) (Brown Rat) and \(\textit{Sus scrofa}\) (Wild Boar). There are a huge amount of common DNA consecutive sequences extracted and, for simplicity, there are only 26 ones whose lengths are longer than 50 base-pair (bp) selected for illustration. Among above 26, there are 13 reverified as no repetitive sequences that could be seemed as the clues to reinspect the relationship of viruses and mammals. Via cloud computing that can provide with more computing nodes then ten used in this study, it is believed that this approach can handle with more complete genomes of species, and then provide more common genomic fossils to biologists or virologist to verify the potential connections among diverse species in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. FTP Site for Genomes in NCBI. https://ftp.ncbi.nih.gov/genomes

  2. International Committee on Taxonomy of Viruses (ICTV). https://talk.ictvonline.org/

  3. National Center for Biotechnology Information(NCBI). http://www.ncbi.nlm.nih.gov/

  4. NCBI Web BLAST (Basic Local Alignment Search Tool. https://blast.ncbi.nlm.nih.gov/Blast.cgi

  5. RepeatMasker. http://www.repeatmasker.org/

  6. The UCSC Genome Browser. https://genome.ucsc.edu/index.html

  7. Virus-Host DB. https://www.genome.jp/virushostdb/

  8. Aiewsakun, P., Katzourakis, A.: Marine origin of retroviruses in the early palaeozoic era. Nat. Commun. 8, 13954–13954 (2017). https://doi.org/10.1038/ncomms13954

  9. Baltimore, D.: Animal Virology, vol. 4. Elsevier Science (1976)

    Google Scholar 

  10. Bandín, I., Dopazo, C.P.: Host range, host specificity and hypothesized host shift events among viruses of lower vertebrates. Veterinary Res. 42(1), 67–67 (2011). https://doi.org/10.1186/1297-9716-42-67

  11. Chatzou, M., et al.: Multiple sequence alignment modeling: methods and applications. Briefings Bioinform. 17(6), 1009–1023 (2015). https://doi.org/10.1093/bib/bbv099

  12. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press (2009)

    Google Scholar 

  13. Garson, J.A., Usher, L., Al-Chalabi, A., Huggett, J., Day, E.F., McCormick, A.L.: Quantitative analysis of human endogenous retrovirus-k transcripts in postmortem premotor cortex fails to confirm elevated expression of herv-k rna in amyotrophic lateral sclerosis. Acta Neuropathol. Commun. 7(1), 45 (2019). https://doi.org/10.1186/s40478-019-0698-2

    Article  Google Scholar 

  14. Gulcher, J.: Microsatellite markers for linkage and association studies. Cold Spring Harbor Protocols 2012(4), pdb.top068510 (2012). https://doi.org/10.1101/pdb.top068510

  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences : computer science and computational biology. Cambridge University Press (1997)

    Google Scholar 

  16. Wang, J.-D., Noto Susanto, C.O.: Traffic flow prediction with heterogenous data using a hybrid cnn-lstm model. Comput. Mater. Continua 76(3), 3097–3112 (2023). https://doi.org/10.32604/cmc.2023.040914, http://www.techscience.com/cmc/v76n3/54369

  17. Wang, J.-D., Noto Susanto, C.O.: Traffic flow prediction with heterogeneous spatiotemporal data based on a hybrid deep learning model using attention-mechanism. Comput. Model. Eng. Sci. 140(2), 1711–1728 (2024) https://doi.org/10.32604/cmes.2024.048955, http://www.techscience.com/CMES/v140n2/56559

  18. Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014).https://doi.org/10.1145/2503009

  19. Madeira, F., et al.: The embl-ebi search and sequence analysis tools apis in 2019. Nucleic Acids Res. 47(W1), W636–W641 (2019). https://doi.org/10.1093/nar/gkz268

    Article  Google Scholar 

  20. Meyer, T.J., Rosenkrantz, J.L., Carbone, L., Chavez, S.L.: Endogenous retroviruses: with us and against us. Front. Chem. 5, 23 (2017). https://doi.org/10.3389/fchem.2017.00023

  21. Mihara, T., et al.: Linking virus genomes with host taxonomy. Viruses 8, 66 (2016)https://doi.org/10.3390/v8030066

  22. Mount, D.W.: Bioinformatics: Sequence and Genome Analysis, 2 edn. Cold Spring Harbor Laboratory Press (2004)

    Google Scholar 

  23. Pérez-Wohlfeil, E., Diaz-del Pino, S., Trelles, O.: Ultra-fast genome comparison for large-scale genomic experiments. Sci. Rep. 9(1), 10274 (2019). https://doi.org/10.1038/s41598-019-46773-w

    Article  Google Scholar 

  24. Usdin, K.: The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome Res. 18(7), 1011–1019 (2008). https://doi.org/10.1101/gr.070409.107

  25. Wang, C.T.: Method for extracting maximal repeat patterns and computing frequency distribution tables, U.S. Patent No. 10,409,844 (Sep 2019)

    Google Scholar 

  26. Wang, J.D.: A study of comparing the ambiguity of existing virus taxonomy structures using protein’s region names in the vector space model. In: 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2015), pp. C–15004 (2015)

    Google Scholar 

  27. Wang, J.D.: Extracting significant pattern histories from timestamped texts using mapreduce. J. Supercomput., 1–25 (2016)

    Google Scholar 

  28. Wang, J.D.: A novel approach to mine for genetic markers via comparing class frequency distributions of maximal repeats extracted from tagged whole genomic sequences. In: Abdurakhmonov, I.Y. (ed.) Bioinformatics in the Era of Post Genomics and Big Data, chap. 5. IntechOpen, Rijeka (2018). https://doi.org/10.5772/intechopen.75113

  29. Wang, J.D.: Reducing the gap between phenotypes and genotypes via comparing tagged whole genomic sequences. In: The 12th International Conference on Advancements in Bioinformatics and Drug Discovery; J. Proteomics Bioinform. (2018)

    Google Scholar 

  30. Wang, J.D., Hwang, M.C.: A novel approach to extract significant patterns of travel time intervals of vehicles from freeway gantry timestamp sequences. Appli. Sci. 7(9) (2017). https://doi.org/10.3390/app7090878

  31. Wang, J.D., Pan, S.H., Ho, C.Y., Chuan Liao, S., Lien, Y.N., Nurmandi, A.: Online web query system for various frequency distributions of bus passengers in taichung city of taiwan. IET Smart Cities 2(3), 135–145 (2020)

    Google Scholar 

  32. Wang, J.D., Wang, Y.C., Hu, R.M., Tsai, J.: Extracting the co-occurrences of dna maximal repeats in both human and viruses. In: The 17th annual IEEE International Conference on Bioinformatics and Bioengineering (BIBE 2017) (2017)

    Google Scholar 

  33. Wang, J.-D.: A novel approach to improve quality control by comparing the tagged sequences of product traceability. MATEC Web Conf. 201, 05002 (2018)https://doi.org/10.1051/matecconf/201820105002

Download references

Acknowledgement

The project is funded in part by the Ministry of Science and Technology (MOST) under Grant No. 107-2632-E-468-002. Thanks for Prof. Tsung-Chi Chen discussing plant viruses. Thanks for Mr. Ren-Der Huang for maintaining Hadoop cluster computing environment.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing-Doo Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, JD., Wang, YC. (2024). Extracting Common DNA Segments from the Complete Genomes of 7538 Viruses and Five Selected Mammals. In: Nguyen, NT., et al. Advances in Computational Collective Intelligence. ICCCI 2024. Communications in Computer and Information Science, vol 2165. Springer, Cham. https://doi.org/10.1007/978-3-031-70248-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70248-8_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70247-1

  • Online ISBN: 978-3-031-70248-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics