Abstract
This paper aims to extract common DNA consecutive sequences appearing in both of the complete genomes of viruses and that of mammals. With these common DNA sequences as genomic fossils, biologists or virologists can trace possible pathway of genomic evolution across species. To meet the requirement of huge computation to extract common DNA sequences from complete genomes of all viruses and mammals selected, this study adopts one previously developed approach that was based on MapReduce programming model. This study has experiments for extracting common DNA sequences run on a Hadoop cluster containing ten computing nodes. Experimental resources includes the whole genomic sequences of 7,538 viruses and five selected mammals, including \(\textit{Homo sapiens}\) (Human), \(\textit{Pan troglodytes}\) (Chimpanzee), \(\textit{Mus musculus}\) (House Mouse), \(\textit{Rattus norvegicus}\) (Brown Rat) and \(\textit{Sus scrofa}\) (Wild Boar). There are a huge amount of common DNA consecutive sequences extracted and, for simplicity, there are only 26 ones whose lengths are longer than 50 base-pair (bp) selected for illustration. Among above 26, there are 13 reverified as no repetitive sequences that could be seemed as the clues to reinspect the relationship of viruses and mammals. Via cloud computing that can provide with more computing nodes then ten used in this study, it is believed that this approach can handle with more complete genomes of species, and then provide more common genomic fossils to biologists or virologist to verify the potential connections among diverse species in the future.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
FTP Site for Genomes in NCBI. https://ftp.ncbi.nih.gov/genomes
International Committee on Taxonomy of Viruses (ICTV). https://talk.ictvonline.org/
National Center for Biotechnology Information(NCBI). http://www.ncbi.nlm.nih.gov/
NCBI Web BLAST (Basic Local Alignment Search Tool. https://blast.ncbi.nlm.nih.gov/Blast.cgi
RepeatMasker. http://www.repeatmasker.org/
The UCSC Genome Browser. https://genome.ucsc.edu/index.html
Virus-Host DB. https://www.genome.jp/virushostdb/
Aiewsakun, P., Katzourakis, A.: Marine origin of retroviruses in the early palaeozoic era. Nat. Commun. 8, 13954–13954 (2017). https://doi.org/10.1038/ncomms13954
Baltimore, D.: Animal Virology, vol. 4. Elsevier Science (1976)
Bandín, I., Dopazo, C.P.: Host range, host specificity and hypothesized host shift events among viruses of lower vertebrates. Veterinary Res. 42(1), 67–67 (2011). https://doi.org/10.1186/1297-9716-42-67
Chatzou, M., et al.: Multiple sequence alignment modeling: methods and applications. Briefings Bioinform. 17(6), 1009–1023 (2015). https://doi.org/10.1093/bib/bbv099
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press (2009)
Garson, J.A., Usher, L., Al-Chalabi, A., Huggett, J., Day, E.F., McCormick, A.L.: Quantitative analysis of human endogenous retrovirus-k transcripts in postmortem premotor cortex fails to confirm elevated expression of herv-k rna in amyotrophic lateral sclerosis. Acta Neuropathol. Commun. 7(1), 45 (2019). https://doi.org/10.1186/s40478-019-0698-2
Gulcher, J.: Microsatellite markers for linkage and association studies. Cold Spring Harbor Protocols 2012(4), pdb.top068510 (2012). https://doi.org/10.1101/pdb.top068510
Gusfield, D.: Algorithms on Strings, Trees, and Sequences : computer science and computational biology. Cambridge University Press (1997)
Wang, J.-D., Noto Susanto, C.O.: Traffic flow prediction with heterogenous data using a hybrid cnn-lstm model. Comput. Mater. Continua 76(3), 3097–3112 (2023). https://doi.org/10.32604/cmc.2023.040914, http://www.techscience.com/cmc/v76n3/54369
Wang, J.-D., Noto Susanto, C.O.: Traffic flow prediction with heterogeneous spatiotemporal data based on a hybrid deep learning model using attention-mechanism. Comput. Model. Eng. Sci. 140(2), 1711–1728 (2024) https://doi.org/10.32604/cmes.2024.048955, http://www.techscience.com/CMES/v140n2/56559
Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using mapreduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014).https://doi.org/10.1145/2503009
Madeira, F., et al.: The embl-ebi search and sequence analysis tools apis in 2019. Nucleic Acids Res. 47(W1), W636–W641 (2019). https://doi.org/10.1093/nar/gkz268
Meyer, T.J., Rosenkrantz, J.L., Carbone, L., Chavez, S.L.: Endogenous retroviruses: with us and against us. Front. Chem. 5, 23 (2017). https://doi.org/10.3389/fchem.2017.00023
Mihara, T., et al.: Linking virus genomes with host taxonomy. Viruses 8, 66 (2016)https://doi.org/10.3390/v8030066
Mount, D.W.: Bioinformatics: Sequence and Genome Analysis, 2 edn. Cold Spring Harbor Laboratory Press (2004)
Pérez-Wohlfeil, E., Diaz-del Pino, S., Trelles, O.: Ultra-fast genome comparison for large-scale genomic experiments. Sci. Rep. 9(1), 10274 (2019). https://doi.org/10.1038/s41598-019-46773-w
Usdin, K.: The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome Res. 18(7), 1011–1019 (2008). https://doi.org/10.1101/gr.070409.107
Wang, C.T.: Method for extracting maximal repeat patterns and computing frequency distribution tables, U.S. Patent No. 10,409,844 (Sep 2019)
Wang, J.D.: A study of comparing the ambiguity of existing virus taxonomy structures using protein’s region names in the vector space model. In: 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2015), pp. C–15004 (2015)
Wang, J.D.: Extracting significant pattern histories from timestamped texts using mapreduce. J. Supercomput., 1–25 (2016)
Wang, J.D.: A novel approach to mine for genetic markers via comparing class frequency distributions of maximal repeats extracted from tagged whole genomic sequences. In: Abdurakhmonov, I.Y. (ed.) Bioinformatics in the Era of Post Genomics and Big Data, chap. 5. IntechOpen, Rijeka (2018). https://doi.org/10.5772/intechopen.75113
Wang, J.D.: Reducing the gap between phenotypes and genotypes via comparing tagged whole genomic sequences. In: The 12th International Conference on Advancements in Bioinformatics and Drug Discovery; J. Proteomics Bioinform. (2018)
Wang, J.D., Hwang, M.C.: A novel approach to extract significant patterns of travel time intervals of vehicles from freeway gantry timestamp sequences. Appli. Sci. 7(9) (2017). https://doi.org/10.3390/app7090878
Wang, J.D., Pan, S.H., Ho, C.Y., Chuan Liao, S., Lien, Y.N., Nurmandi, A.: Online web query system for various frequency distributions of bus passengers in taichung city of taiwan. IET Smart Cities 2(3), 135–145 (2020)
Wang, J.D., Wang, Y.C., Hu, R.M., Tsai, J.: Extracting the co-occurrences of dna maximal repeats in both human and viruses. In: The 17th annual IEEE International Conference on Bioinformatics and Bioengineering (BIBE 2017) (2017)
Wang, J.-D.: A novel approach to improve quality control by comparing the tagged sequences of product traceability. MATEC Web Conf. 201, 05002 (2018)https://doi.org/10.1051/matecconf/201820105002
Acknowledgement
The project is funded in part by the Ministry of Science and Technology (MOST) under Grant No. 107-2632-E-468-002. Thanks for Prof. Tsung-Chi Chen discussing plant viruses. Thanks for Mr. Ren-Der Huang for maintaining Hadoop cluster computing environment.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, JD., Wang, YC. (2024). Extracting Common DNA Segments from the Complete Genomes of 7538 Viruses and Five Selected Mammals. In: Nguyen, NT., et al. Advances in Computational Collective Intelligence. ICCCI 2024. Communications in Computer and Information Science, vol 2165. Springer, Cham. https://doi.org/10.1007/978-3-031-70248-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-70248-8_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70247-1
Online ISBN: 978-3-031-70248-8
eBook Packages: Computer ScienceComputer Science (R0)