Comparing Reverse Complementary Genomic Words Based on Their Distance Distributions and Frequencies
In this work, we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is also explored, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff’s rules. This study uses both the complete human genome and its repeat-masked version.
KeywordsChargaff’s rules Human genome Distance distribution Peak dissimilarity Symmetric word pairs
This work was partially supported by the Portuguese Foundation for Science and Technology (FCT), Center for Research and Development in Mathematics and Applications (CIDMA), Institute of Biomedicine (iBiMED) and Institute of Electronics and Telematics Engineering of Aveiro (IEETA), within projects UID/MAT/04106/2013, UID/BIM/04501/2013 and UID/CEC/00127/2013. A. Tavares acknowledges the Ph.D. Grant PD/BD/105729/2014 from the FCT. The research of P. Brito was financed by the ERDF—European Regional Development Fund through the Operational Programme for Competitiveness and Internationalization—COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by the FCT as part of project UID/EEA/50014/2013. The research of J. Raymaekers and P. J. Rousseeuw was supported by projects of Internal Funds KU Leuven.
- 2.Tavares AH, Afreixo V, Rodrigues JMOS, Bastos CAC (2015) The symmetry of oligonucleotide distance distributions in the human genome. Proc ICPRAM 2:256–263Google Scholar
- 4.Zhang SH, Huang YZ (2010) Strand symmetry: characteristics and origins. In: 2010 4th international conference on bioinformatics and biomedical engineering (iCBBE). IEEE, pp 1–4Google Scholar
- 11.Tavares AH, Raymaekers J, Rousseeuw PJ, Silva RM, Bastos CAC, Pinho AJ, Brito P, Afreixo V (2017) Dissimilar symmetric word pairs in the human genome. In: Fdez-Riverola F, Mohamad M, Rocha M, De Paz J, Pinto T (eds) 11th International Conference on Practical Applications of Computational Biology & Bioinformatics. PACBB 2017. Advances in Intelligent Systems and Computing, vol 161. Springer, Cham, pp 248–256Google Scholar
- 12.Agresti A (2007) An introduction to categorical data analysis. Wiley series in probability and statistics. Wiley, New YorkGoogle Scholar
- 14.Jeffreys H (1946) An invariant form for the prior probability in estimation problems. In: Proceedings of the Royal Society of London. Series A, Mathematical and physical sciences, vol 186. The Royal Society, London, pp 453–461Google Scholar
- 15.Smit AFA, Hubley RM, Green P (2013) RepeatMasker open-4.0. 2013–2015. http://repeatmasker.org
- 17.Fu JC (1996) Distribution theory of runs and patterns associated with a sequence of multi-state trials. Stat Sin 6:957–974Google Scholar